Leaderboard

49 problems from 47 repositories selected within the current time window.

You can adjust time window, modifying the problems' release start and end dates. Evaluations highlighted in red may be potentially contaminated, meaning they include tasks that were created before the model's release date. If a selected time window is wider than the range of task dates a model was evaluated on, the model will appear at the bottom of the leaderboard with N/A values.

Insights

September 2025

Claude Sonnet 4.5 demonstrated notably strong generalization and problem coverage, achieving the highest pass@5 (55.1%) and uniquely solving several instances that no other model on the leaderboard managed to resolve: python-trio/trio-3334, cubed-dev/cubed-799, canopen-python/canopen-613.
Grok Code Fast 1 and gpt-oss-120b stand out as ultra-efficient budget options, delivering around 29%–30% resolved rate for only $0.03–$0.04 per problem.
We observed that Anthropic models (e.g., Claude Sonnet 4) do not use caching by default, unlike other frontier models. Proper use of caching dramatically reduces inference costs – for instance, the average per-problem cost for Claude Sonnet 4 dropped from $5.29 in our August release to just $0.91 in September. All Anthropic models in the current September release were evaluated with caching enabled, ensuring cost figures are now directly comparable to other frontier models.
All models on the leaderboard were evaluated using the ChatCompletions API, except for gpt-5-codex and gpt-oss-120b, which are only accessible via the Responses API. The Responses API natively supports reasoning models and allows linking to previous responses through unique references. This mechanism leverages the model's internal reasoning context from earlier steps – a feature turned out to be beneficial for agentic systems that require multi-step reasoning continuity.
We also evaluated gpt-5-medium with reasoning context reuse enabled via the Responses API, where it achieved a resolved rate of 41.2% and pass@5 of 51%. However, to maintain fairness, we excluded these results from the leaderboard since other reasoning-capable models currently do not have reasoning-context reuse enabled within our evaluation framework. We're interested in evaluating all frontier models with preserving reasoning context from earlier steps to validate how their performance changes.
In our evaluation, we observed that gpt-5-high performed worse than gpt-5-medium. We initially attributed this to the agent's maximum step limit, theorizing that gpt-5-high requires more steps to run tests and check corner cases. However, doubling the max_step_limit from its default of 80 to 160 yielded only a slight performance increase (pass@1: 36.3% -> 38.3%, pass@5: 46.9% -> 48.9%). An alternative hypothesis, which we will validate shortly, is that gpt-5-high benefits especially from using its previous reasoning steps.

Chart

August 2025

Kimi-K2 0915 has grown significantly (34.6% -> 42.3% increase in resolved rate) and is now in the top 3 open-source models.
DeepSeek V3.1 also improved, though less dramatically. At the same time, the number of tokens produced has grown almost 4x.
Qwen3-Next-80B-A3B-Instruct, despite not being trained directly for coding, performs on par with the 30B-Coder. To reflect models speed, we’re also thinking about how best to report efficiency metrics such as tokens/sec on the leaderboard.
Finally, Grok 4: the frontier model from xAI has now entered the leaderboard and is among the top performers.

Chart

Rank	Model	Resolved Rate (%)	Resolved Rate SEM (±)	Pass@5 (%)	Cost per Problem ($)	Tokens per Problem
1	Claude Sonnet 4.5	44.5%	1.00%	55.1%	$0.89	1,797,999
2	gpt-5-codex	41.2%	0.76%	44.9%	$0.86	1,658,941
3	Claude Sonnet 4	40.6%	1.08%	46.9%	$0.91	1,915,332
4	Claude Opus 4.1	40.2%	0.77%	44.9%	$4.03	1,675,141
5	gpt-5-2025-08-07-medium	38.8%	1.29%	44.9%	$0.72	1,219,914
6	gpt-5-mini-2025-08-07-medium	37.1%	1.19%	44.9%	$0.32	1,039,942
7	GLM-4.6	37.0%	1.00%	42.9%	$0.77	1,677,742
8	gpt-5-2025-08-07-high	36.3%	2.08%	46.9%	$1.05	1,641,219
9	o3-2025-04-16	36.3%	1.98%	46.9%	$1.33	1,404,415
10	Qwen3-Coder-480B-A35B-Instruct	35.7%	1.51%	44.9%	$0.59	1,466,625
11	GLM-4.5	35.1%	1.35%	44.9%	$0.92	1,518,166
12	Grok 4	34.6%	2.07%	44.9%	$1.53	1,168,808
13	GLM-4.5 Air	31.0%	2.18%	42.9%	$0.32	1,578,223
14	gpt-5-2025-08-07-minimal	30.6%	0.65%	46.9%	$0.32	629,319
15	Grok Code Fast 1	30.1%	2.11%	42.9%	$0.04	957,736
16	gpt-oss-120b	28.7%	1.30%	42.9%	$0.04	1,161,946
17	Qwen3-235B-A22B-Instruct-2507	28.6%	1.83%	40.8%	$0.18	899,731
18	gpt-4.1-2025-04-14	28.4%	1.85%	42.9%	$0.48	518,584
19	o4-mini-2025-04-16	27.3%	1.53%	44.9%	$0.95	1,726,082
20	Kimi K2 Instruct 0905	25.9%	2.02%	40.8%	$1.10	1,815,589
21	DeepSeek-V3.1	24.9%	2.47%	42.9%	$0.41	1,509,692
22	Qwen3-Coder-30B-A3B-Instruct	23.3%	1.38%	32.7%	$0.06	584,337
23	Qwen3-235B-A22B-Thinking-2507	22.4%	1.29%	32.7%	$0.14	512,537
24	DeepSeek-V3-0324	22.1%	0.72%	30.6%	$0.17	324,623
25	gemini-2.5-pro	21.4%	1.07%	34.7%	$0.59	1,111,184
26	DeepSeek-R1-0528	20.0%	2.08%	30.6%	$0.63	679,114
27	Qwen3-Next-80B-A3B-Instruct	19.7%	1.35%	32.7%	$0.23	444,208
28	gemini-2.5-flash	17.6%	0.99%	30.6%	$0.17	1,384,944
29	gpt-4.1-mini-2025-04-14	15.9%	2.36%	36.7%	$0.21	1,217,617
30	Qwen3-30B-A3B-Thinking-2507	13.1%	1.38%	26.5%	$0.05	436,857
31	Qwen3-30B-A3B-Instruct-2507	10.2%	1.44%	26.5%	$0.12	1,128,517
32	Claude Sonnet 3.5	N/A	N/A	N/A	N/A	N/A
33	DeepSeek-V3	N/A	N/A	N/A	N/A	N/A
34	DeepSeek-V3-0324	N/A	N/A	N/A	N/A	N/A
35	Devstral-Small-2505	N/A	N/A	N/A	N/A	N/A
36	gemini-2.0-flash	N/A	N/A	N/A	N/A	N/A
37	gemini-2.0-flash	N/A	N/A	N/A	N/A	N/A
38	gemini-2.5-flash-preview-05-20 no-thinking	N/A	N/A	N/A	N/A	N/A
39	gemini-2.5-flash-preview-05-20 no-thinking	N/A	N/A	N/A	N/A	N/A
40	gemma-3-27b-it	N/A	N/A	N/A	N/A	N/A
41	gpt-4.1-2025-04-14	N/A	N/A	N/A	N/A	N/A
42	gpt-4.1-mini-2025-04-14	N/A	N/A	N/A	N/A	N/A
43	gpt-4.1-nano-2025-04-14	N/A	N/A	N/A	N/A	N/A
44	gpt-oss-20b	N/A	N/A	N/A	N/A	N/A
45	horizon-alpha	N/A	N/A	N/A	N/A	N/A
46	horizon-beta	N/A	N/A	N/A	N/A	N/A
47	Kimi K2	N/A	N/A	N/A	N/A	N/A
48	Llama-3.3-70B-Instruct	N/A	N/A	N/A	N/A	N/A
49	Llama-4-Maverick-17B-128E-Instruct	N/A	N/A	N/A	N/A	N/A
50	Llama-4-Scout-17B-16E-Instruct	N/A	N/A	N/A	N/A	N/A
51	Qwen2.5-72B-Instruct	N/A	N/A	N/A	N/A	N/A
52	Qwen2.5-Coder-32B-Instruct	N/A	N/A	N/A	N/A	N/A
53	Qwen3-235B-A22B	N/A	N/A	N/A	N/A	N/A
54	Qwen3-235B-A22B no-thinking	N/A	N/A	N/A	N/A	N/A
55	Qwen3-235B-A22B thinking	N/A	N/A	N/A	N/A	N/A
56	Qwen3-32B	N/A	N/A	N/A	N/A	N/A
57	Qwen3-32B no-thinking	N/A	N/A	N/A	N/A	N/A
58	Qwen3-32B thinking	N/A	N/A	N/A	N/A	N/A

News

[2025-10-28]:
- Added new model to the leaderboard: GLM-4.6.
[2025-10-09]:
- Added new models to the leaderboard: Claude Sonnet 4.5, gpt-5-codex, Claude Opus 4.1, Qwen3-30B-A3B-Thinking-2507 and Qwen3-30B-A3B-Instruct-2507.
- Added a new Insights section providing analysis and key takeaways from recent model and data releases.
- Deprecated following models:
  - Text: Llama-3.3-70B-Instruct, Llama-4-Maverick-17B-128E-Instruct, gemma-3-27b-it and Qwen2.5-72B-Instruct.
  - Tools: Claude Sonnet 3.5, Kimi K2, gemini-2.0-flash, Qwen3-235B-A22B and Qwen3-32B.
[2025-09-17]:
- Added new models to the leaderboard: Grok 4, Kimi K2 Instruct 0905, DeepSeek-V3.1 and Qwen3-Next-80B-A3B-Instruct.
[2025-09-04]:
- Added new models to the leaderboard: GLM-4.5, GLM-4.5 Air, Grok Code Fast 1, Kimi K2, gpt-5-mini-2025-08-07-medium, gpt-oss-120b and gpt-oss-20b.
- Introduced Cost per Problem and Tokens per Problem columns.
- Added links to the pull requests within the selected time window. You can review them via the Inspect button.
- Deprecated following models:
  - Text: DeepSeek-V3, DeepSeek-V3-0324, Devstral-Small-2505, gemini-2.0-flash, gpt-4.1-2025-04-14, gpt-4.1-mini-2025-04-14, gpt-4.1-nano-2025-04-14, Llama-4-Scout-17B-16E-Instruct and Qwen2.5-Coder-32B-Instruct.
  - Tools: horizon-alpha and horizon-beta.
[2025-08-12]: Added new models to the leaderboard: gpt-5-medium-2025-08-07, gpt-5-high-2025-08-07 and gpt-5-minimal-2025-08-07.
[2025-08-02]: Added new models to the leaderboard: Qwen3-Coder-30B-A3B-Instruct, horizon-beta.
[2025-07-31]:
- Added new models to the leaderboard: gemini-2.5-pro, gemini-2.5-flash, o4-mini-2025-04-16, Qwen3-Coder-480B-A35B-Instruct, Qwen3-235B-A22B-Thinking-2507, Qwen3-235B-A22B-Instruct-2507, DeepSeek-R1-0528 and horizon-alpha.
- Deprecated models: gemini-2.5-flash-preview-05-20 no-thinking.
- Updated demo format: tool calls are now shown as distinct assistant and tool messages.
[2025-07-11]: Released Docker images for all leaderboard problems and published a dedicated HuggingFace dataset containing only the problems used in the leaderboard.
[2025-07-10]: Added models performance chart and evaluations on June data.
[2025-06-12]: Added tool usage support, evaluations on May data and new models: Claude Sonnet 3.5/4 and o3.
[2025-05-22]: Added Devstral-Small-2505 to the leaderboard.
[2025-05-21]: Added new models to the leaderboard: gpt-4.1-mini-2025-04-14, gpt-4.1-nano-2025-04-14, gemini-2.0-flash and gemini-2.5-flash-preview-05-20.

SWE-rebench: A Continuously Evolving and Decontaminated Benchmark for Software Engineering LLMs

News