49 problems from 47 repositories selected within the current time window.
You can adjust time window, modifying the problems' release start and end dates. Evaluations highlighted in red may be potentially contaminated, meaning they include tasks that were created before the model's release date. If a selected time window is wider than the range of task dates a model was evaluated on, the model will appear at the bottom of the leaderboard with N/A values.
Insights
September 2025
- Claude Sonnet 4.5 demonstrated notably strong generalization and problem coverage, achieving the highest pass@5 (55.1%) and uniquely solving several instances that no other model on the leaderboard managed to resolve: python-trio/trio-3334, cubed-dev/cubed-799, canopen-python/canopen-613.
- Grok Code Fast 1 and gpt-oss-120b stand out as ultra-efficient budget options, delivering around 29%–30% resolved rate for only $0.03–$0.04 per problem.
- We observed that Anthropic models (e.g., Claude Sonnet 4) do not use caching by default, unlike other frontier models. Proper use of caching dramatically reduces inference costs – for instance, the average per-problem cost for Claude Sonnet 4 dropped from $5.29 in our August release to just $0.91 in September. All Anthropic models in the current September release were evaluated with caching enabled, ensuring cost figures are now directly comparable to other frontier models.
- All models on the leaderboard were evaluated using the ChatCompletions API, except for gpt-5-codex and gpt-oss-120b, which are only accessible via the Responses API. The Responses API natively supports reasoning models and allows linking to previous responses through unique references. This mechanism leverages the model's internal reasoning context from earlier steps – a feature turned out to be beneficial for agentic systems that require multi-step reasoning continuity.
- We also evaluated gpt-5-medium with reasoning context reuse enabled via the Responses API, where it achieved a resolved rate of 41.2% and pass@5 of 51%. However, to maintain fairness, we excluded these results from the leaderboard since other reasoning-capable models currently do not have reasoning-context reuse enabled within our evaluation framework. We're interested in evaluating all frontier models with preserving reasoning context from earlier steps to validate how their performance changes.
- In our evaluation, we observed that gpt-5-high performed worse than gpt-5-medium. We initially attributed this to the agent's maximum step limit, theorizing that gpt-5-high requires more steps to run tests and check corner cases. However, doubling the max_step_limit from its default of 80 to 160 yielded only a slight performance increase (pass@1: 36.3% -> 38.3%, pass@5: 46.9% -> 48.9%). An alternative hypothesis, which we will validate shortly, is that gpt-5-high benefits especially from using its previous reasoning steps.

August 2025
- Kimi-K2 0915 has grown significantly (34.6% -> 42.3% increase in resolved rate) and is now in the top 3 open-source models.
- DeepSeek V3.1 also improved, though less dramatically. At the same time, the number of tokens produced has grown almost 4x.
- Qwen3-Next-80B-A3B-Instruct, despite not being trained directly for coding, performs on par with the 30B-Coder. To reflect models speed, we’re also thinking about how best to report efficiency metrics such as tokens/sec on the leaderboard.
- Finally, Grok 4: the frontier model from xAI has now entered the leaderboard and is among the top performers.

Rank | Model | Resolved Rate (%) | Resolved Rate SEM (±) | Pass@5 (%) | Cost per Problem ($) | Tokens per Problem |
|---|---|---|---|---|---|---|
| 1 | Claude Sonnet 4.5 | 44.5% | 1.00% | 55.1% | $0.89 | 1,797,999 |
| 2 | gpt-5-codex | 41.2% | 0.76% | 44.9% | $0.86 | 1,658,941 |
| 3 | Claude Sonnet 4 | 40.6% | 1.08% | 46.9% | $0.91 | 1,915,332 |
| 4 | Claude Opus 4.1 | 40.2% | 0.77% | 44.9% | $4.03 | 1,675,141 |
| 5 | gpt-5-2025-08-07-medium | 38.8% | 1.29% | 44.9% | $0.72 | 1,219,914 |
| 6 | gpt-5-mini-2025-08-07-medium | 37.1% | 1.19% | 44.9% | $0.32 | 1,039,942 |
| 7 | GLM-4.6 | 37.0% | 1.00% | 42.9% | $0.77 | 1,677,742 |
| 8 | gpt-5-2025-08-07-high | 36.3% | 2.08% | 46.9% | $1.05 | 1,641,219 |
| 9 | o3-2025-04-16 | 36.3% | 1.98% | 46.9% | $1.33 | 1,404,415 |
| 10 | Qwen3-Coder-480B-A35B-Instruct | 35.7% | 1.51% | 44.9% | $0.59 | 1,466,625 |
| 11 | GLM-4.5 | 35.1% | 1.35% | 44.9% | $0.92 | 1,518,166 |
| 12 | Grok 4 | 34.6% | 2.07% | 44.9% | $1.53 | 1,168,808 |
| 13 | GLM-4.5 Air | 31.0% | 2.18% | 42.9% | $0.32 | 1,578,223 |
| 14 | gpt-5-2025-08-07-minimal | 30.6% | 0.65% | 46.9% | $0.32 | 629,319 |
| 15 | Grok Code Fast 1 | 30.1% | 2.11% | 42.9% | $0.04 | 957,736 |
| 16 | gpt-oss-120b | 28.7% | 1.30% | 42.9% | $0.04 | 1,161,946 |
| 17 | Qwen3-235B-A22B-Instruct-2507 | 28.6% | 1.83% | 40.8% | $0.18 | 899,731 |
| 18 | gpt-4.1-2025-04-14 | 28.4% | 1.85% | 42.9% | $0.48 | 518,584 |
| 19 | o4-mini-2025-04-16 | 27.3% | 1.53% | 44.9% | $0.95 | 1,726,082 |
| 20 | Kimi K2 Instruct 0905 | 25.9% | 2.02% | 40.8% | $1.10 | 1,815,589 |
| 21 | DeepSeek-V3.1 | 24.9% | 2.47% | 42.9% | $0.41 | 1,509,692 |
| 22 | Qwen3-Coder-30B-A3B-Instruct | 23.3% | 1.38% | 32.7% | $0.06 | 584,337 |
| 23 | Qwen3-235B-A22B-Thinking-2507 | 22.4% | 1.29% | 32.7% | $0.14 | 512,537 |
| 24 | DeepSeek-V3-0324 | 22.1% | 0.72% | 30.6% | $0.17 | 324,623 |
| 25 | gemini-2.5-pro | 21.4% | 1.07% | 34.7% | $0.59 | 1,111,184 |
| 26 | DeepSeek-R1-0528 | 20.0% | 2.08% | 30.6% | $0.63 | 679,114 |
| 27 | Qwen3-Next-80B-A3B-Instruct | 19.7% | 1.35% | 32.7% | $0.23 | 444,208 |
| 28 | gemini-2.5-flash | 17.6% | 0.99% | 30.6% | $0.17 | 1,384,944 |
| 29 | gpt-4.1-mini-2025-04-14 | 15.9% | 2.36% | 36.7% | $0.21 | 1,217,617 |
| 30 | Qwen3-30B-A3B-Thinking-2507 | 13.1% | 1.38% | 26.5% | $0.05 | 436,857 |
| 31 | Qwen3-30B-A3B-Instruct-2507 | 10.2% | 1.44% | 26.5% | $0.12 | 1,128,517 |
| 32 | Claude Sonnet 3.5 | N/A | N/A | N/A | N/A | N/A |
| 33 | DeepSeek-V3 | N/A | N/A | N/A | N/A | N/A |
| 34 | DeepSeek-V3-0324 | N/A | N/A | N/A | N/A | N/A |
| 35 | Devstral-Small-2505 | N/A | N/A | N/A | N/A | N/A |
| 36 | gemini-2.0-flash | N/A | N/A | N/A | N/A | N/A |
| 37 | gemini-2.0-flash | N/A | N/A | N/A | N/A | N/A |
| 38 | gemini-2.5-flash-preview-05-20 no-thinking | N/A | N/A | N/A | N/A | N/A |
| 39 | gemini-2.5-flash-preview-05-20 no-thinking | N/A | N/A | N/A | N/A | N/A |
| 40 | gemma-3-27b-it | N/A | N/A | N/A | N/A | N/A |
| 41 | gpt-4.1-2025-04-14 | N/A | N/A | N/A | N/A | N/A |
| 42 | gpt-4.1-mini-2025-04-14 | N/A | N/A | N/A | N/A | N/A |
| 43 | gpt-4.1-nano-2025-04-14 | N/A | N/A | N/A | N/A | N/A |
| 44 | gpt-oss-20b | N/A | N/A | N/A | N/A | N/A |
| 45 | horizon-alpha | N/A | N/A | N/A | N/A | N/A |
| 46 | horizon-beta | N/A | N/A | N/A | N/A | N/A |
| 47 | Kimi K2 | N/A | N/A | N/A | N/A | N/A |
| 48 | Llama-3.3-70B-Instruct | N/A | N/A | N/A | N/A | N/A |
| 49 | Llama-4-Maverick-17B-128E-Instruct | N/A | N/A | N/A | N/A | N/A |
| 50 | Llama-4-Scout-17B-16E-Instruct | N/A | N/A | N/A | N/A | N/A |
| 51 | Qwen2.5-72B-Instruct | N/A | N/A | N/A | N/A | N/A |
| 52 | Qwen2.5-Coder-32B-Instruct | N/A | N/A | N/A | N/A | N/A |
| 53 | Qwen3-235B-A22B | N/A | N/A | N/A | N/A | N/A |
| 54 | Qwen3-235B-A22B no-thinking | N/A | N/A | N/A | N/A | N/A |
| 55 | Qwen3-235B-A22B thinking | N/A | N/A | N/A | N/A | N/A |
| 56 | Qwen3-32B | N/A | N/A | N/A | N/A | N/A |
| 57 | Qwen3-32B no-thinking | N/A | N/A | N/A | N/A | N/A |
| 58 | Qwen3-32B thinking | N/A | N/A | N/A | N/A | N/A |
News
- [2025-10-28]:
- Added new model to the leaderboard: GLM-4.6.
- [2025-10-09]:
- Added new models to the leaderboard: Claude Sonnet 4.5, gpt-5-codex, Claude Opus 4.1, Qwen3-30B-A3B-Thinking-2507 and Qwen3-30B-A3B-Instruct-2507.
- Added a new Insights section providing analysis and key takeaways from recent model and data releases.
- Deprecated following models:
- Text: Llama-3.3-70B-Instruct, Llama-4-Maverick-17B-128E-Instruct, gemma-3-27b-it and Qwen2.5-72B-Instruct.
- Tools: Claude Sonnet 3.5, Kimi K2, gemini-2.0-flash, Qwen3-235B-A22B and Qwen3-32B.
- [2025-09-17]:
- Added new models to the leaderboard: Grok 4, Kimi K2 Instruct 0905, DeepSeek-V3.1 and Qwen3-Next-80B-A3B-Instruct.
- [2025-09-04]:
- Added new models to the leaderboard: GLM-4.5, GLM-4.5 Air, Grok Code Fast 1, Kimi K2, gpt-5-mini-2025-08-07-medium, gpt-oss-120b and gpt-oss-20b.
- Introduced Cost per Problem and Tokens per Problem columns.
- Added links to the pull requests within the selected time window. You can review them via the
Inspectbutton. - Deprecated following models:
- Text: DeepSeek-V3, DeepSeek-V3-0324, Devstral-Small-2505, gemini-2.0-flash, gpt-4.1-2025-04-14, gpt-4.1-mini-2025-04-14, gpt-4.1-nano-2025-04-14, Llama-4-Scout-17B-16E-Instruct and Qwen2.5-Coder-32B-Instruct.
- Tools: horizon-alpha and horizon-beta.
- [2025-08-12]: Added new models to the leaderboard: gpt-5-medium-2025-08-07, gpt-5-high-2025-08-07 and gpt-5-minimal-2025-08-07.
- [2025-08-02]: Added new models to the leaderboard: Qwen3-Coder-30B-A3B-Instruct, horizon-beta.
- [2025-07-31]:
- Added new models to the leaderboard: gemini-2.5-pro, gemini-2.5-flash, o4-mini-2025-04-16, Qwen3-Coder-480B-A35B-Instruct, Qwen3-235B-A22B-Thinking-2507, Qwen3-235B-A22B-Instruct-2507, DeepSeek-R1-0528 and horizon-alpha.
- Deprecated models: gemini-2.5-flash-preview-05-20 no-thinking.
- Updated demo format: tool calls are now shown as distinct assistant and tool messages.
- [2025-07-11]: Released Docker images for all leaderboard problems and published a dedicated HuggingFace dataset containing only the problems used in the leaderboard.
- [2025-07-10]: Added models performance chart and evaluations on June data.
- [2025-06-12]: Added tool usage support, evaluations on May data and new models: Claude Sonnet 3.5/4 and o3.
- [2025-05-22]: Added Devstral-Small-2505 to the leaderboard.
- [2025-05-21]: Added new models to the leaderboard: gpt-4.1-mini-2025-04-14, gpt-4.1-nano-2025-04-14, gemini-2.0-flash and gemini-2.5-flash-preview-05-20.