01/10/2025
01/11/2025
01234567891011121314151617181920
51 problems from 34 repositories selected within the current time window.
You can adjust time window, modifying the problems' release start and end dates. Evaluations highlighted in red may be potentially contaminated, meaning they include tasks that were created before the model's release date. If a selected time window is wider than the range of task dates a model was evaluated on, the model will appear at the bottom of the leaderboard with N/A values.
Insights
October 2025
  • Claude Sonnet 4.5 delivers the strongest pass@1 and pass@5 results, and it uses more tokens per problem than gpt-5-medium and gpt-5-high despite not running in reasoning mode. This indicates that Sonnet 4.5 adapts its reasoning depth internally and uses its capacity efficiently.
  • GPT-5 variants differ in how often they invoke reasoning: gpt-5-medium uses it in ~58% of steps (avg. 714 tokens), while gpt-5-high increases this to ~62% (avg. 1053 tokens). However, this additional reasoning does not translate into better task-solving ability in our setup: gpt-5-medium achieves a pass@5 of 49.0%, compared to 47.1% for gpt-5-high.
  • MiniMax M2 is the most cost-efficient open-source model among the top performers. Its pricing is $0.255 / $1.02 per 1M input/output tokens, whereas gpt-5-codex costs $1.25 / $10.00 – with cached input available at just $0.125. In agentic workflows, where large trajectory prefixes are reused, this cache advantage can make models with cheap cache reads more beneficial even if their raw input/output prices are higher. In case of gpt-5-codex, it has approximately the same Cost per Problem as MiniMax M2 ($0.51 vs $0.44), being yet much more powerful.
  • GLM-4.6 reaches the agent’s maximum step limit (80 steps in our setup) roughly twice as often as GLM-4.5. This suggests its performance may be constrained by the step budget, and increasing the limit could potentially improve its resolved rate.

Chart

September 2025
  • Claude Sonnet 4.5 demonstrated notably strong generalization and problem coverage, achieving the highest pass@5 (55.1%) and uniquely solving several instances that no other model on the leaderboard managed to resolve: python-trio/trio-3334, cubed-dev/cubed-799, canopen-python/canopen-613.
  • Grok Code Fast 1 and gpt-oss-120b stand out as ultra-efficient budget options, delivering around 29%–30% resolved rate for only $0.03–$0.04 per problem.
  • We observed that Anthropic models (e.g., Claude Sonnet 4) do not use caching by default, unlike other frontier models. Proper use of caching dramatically reduces inference costs – for instance, the average per-problem cost for Claude Sonnet 4 dropped from $5.29 in our August release to just $0.91 in September. All Anthropic models in the current September release were evaluated with caching enabled, ensuring cost figures are now directly comparable to other frontier models.
  • All models on the leaderboard were evaluated using the ChatCompletions API, except for gpt-5-codex and gpt-oss-120b, which are only accessible via the Responses API. The Responses API natively supports reasoning models and allows linking to previous responses through unique references. This mechanism leverages the model's internal reasoning context from earlier steps – a feature turned out to be beneficial for agentic systems that require multi-step reasoning continuity.
  • We also evaluated gpt-5-medium with reasoning context reuse enabled via the Responses API, where it achieved a resolved rate of 41.2% and pass@5 of 51%. However, to maintain fairness, we excluded these results from the leaderboard since other reasoning-capable models currently do not have reasoning-context reuse enabled within our evaluation framework. We're interested in evaluating all frontier models with preserving reasoning context from earlier steps to validate how their performance changes.
  • In our evaluation, we observed that gpt-5-high performed worse than gpt-5-medium. We initially attributed this to the agent's maximum step limit, theorizing that gpt-5-high requires more steps to run tests and check corner cases. However, doubling the max_step_limit from its default of 80 to 160 yielded only a slight performance increase (pass@1: 36.3% -> 38.3%, pass@5: 46.9% -> 48.9%). An alternative hypothesis, which we will validate shortly, is that gpt-5-high benefits especially from using its previous reasoning steps.

Chart

August 2025
  • Kimi-K2 0915 has grown significantly (34.6% -> 42.3% increase in resolved rate) and is now in the top 3 open-source models.
  • DeepSeek V3.1 also improved, though less dramatically. At the same time, the number of tokens produced has grown almost 4x.
  • Qwen3-Next-80B-A3B-Instruct, despite not being trained directly for coding, performs on par with the 30B-Coder. To reflect models speed, we’re also thinking about how best to report efficiency metrics such as tokens/sec on the leaderboard.
  • Finally, Grok 4: the frontier model from xAI has now entered the leaderboard and is among the top performers.

Chart

Rank
Model
Resolved Rate (%)
Resolved Rate SEM (±)
Pass@5 (%)
Cost per Problem ($)
Tokens per Problem
1
Claude Sonnet 4.5
44.3%0.48%56.9%$0.982,002,524
2
gpt-5-2025-08-07-medium
42.4%0.78%49.0%$0.661,460,199
3
gpt-5-codex
40.8%1.44%47.1%$0.511,722,212
4
Claude Opus 4.1
38.8%1.69%49.0%$4.411,849,354
5
gpt-5-2025-08-07-high
38.8%1.30%47.1%$1.001,968,033
6
Claude Sonnet 4
37.3%0.62%47.1%$0.962,024,614
7
GLM-4.5
34.5%1.82%47.1%$1.031,704,653
8
MiniMax M2
32.5%1.59%51.0%$0.441,680,096
9
gpt-5-mini-2025-08-07-medium
32.2%3.08%45.1%$0.281,152,682
10
Qwen3-Coder-480B-A35B-Instruct
31.4%0.62%41.2%$0.611,492,492
11
GLM-4.6
30.2%0.48%45.1%$1.061,776,991
12
Kimi K2 Instruct 0905
29.8%1.90%47.1%$1.141,878,292
13
o3-2025-04-16
28.2%1.59%37.3%$1.361,613,760
14
GLM-4.5 Air
26.7%1.82%45.1%$0.331,590,930
15
DeepSeek-V3.1
26.3%1.92%35.3%$0.341,235,016
16
gpt-5-2025-08-07-minimal
26.0%1.38%35.3%$0.27690,504
17
gemini-2.5-pro
23.9%2.00%39.2%$0.601,012,801
18
gpt-oss-120b
23.5%0.62%39.2%$0.051,562,608
19
gpt-4.1-2025-04-14
22.8%2.15%37.3%$0.43518,678
20
Qwen3-235B-A22B-Instruct-2507
22.7%0.78%41.2%$0.211,060,184
21
Qwen3-235B-A22B-Thinking-2507
21.6%2.84%33.3%$0.17637,147
22
o4-mini-2025-04-16
20.0%1.14%31.4%$0.921,912,669
23
gemini-2.5-flash
19.2%1.14%39.2%$0.121,145,232
24
Qwen3-Next-80B-A3B-Instruct
17.6%0.62%37.3%$0.34665,861
25
Qwen3-Coder-30B-A3B-Instruct
17.3%1.14%27.5%$0.09894,745
26
DeepSeek-R1-0528
17.0%1.00%29.4%$0.45484,003
27
DeepSeek-V3-0324
16.9%1.34%25.5%$0.20390,972
28
gpt-4.1-mini-2025-04-14
15.4%1.22%27.5%$0.181,144,123
29
Qwen3-30B-A3B-Thinking-2507
10.6%1.18%21.6%$0.05398,292
30
Qwen3-30B-A3B-Instruct-2507
9.8%1.07%23.5%$0.06613,885
31
Claude Sonnet 3.5
N/AN/AN/AN/AN/A
32
DeepSeek-V3
N/AN/AN/AN/AN/A
33
DeepSeek-V3-0324
N/AN/AN/AN/AN/A
34
Devstral-Small-2505
N/AN/AN/AN/AN/A
35
gemini-2.0-flash
N/AN/AN/AN/AN/A
36
gemini-2.0-flash
N/AN/AN/AN/AN/A
37
gemini-2.5-flash-preview-05-20 no-thinking
N/AN/AN/AN/AN/A
38
gemini-2.5-flash-preview-05-20 no-thinking
N/AN/AN/AN/AN/A
39
gemma-3-27b-it
N/AN/AN/AN/AN/A
40
gpt-4.1-2025-04-14
N/AN/AN/AN/AN/A
41
gpt-4.1-mini-2025-04-14
N/AN/AN/AN/AN/A
42
gpt-4.1-nano-2025-04-14
N/AN/AN/AN/AN/A
43
gpt-oss-20b
N/AN/AN/AN/AN/A
44
Grok 4
N/AN/AN/AN/AN/A
45
Grok Code Fast 1
N/AN/AN/AN/AN/A
46
horizon-alpha
N/AN/AN/AN/AN/A
47
horizon-beta
N/AN/AN/AN/AN/A
48
Kimi K2
N/AN/AN/AN/AN/A
49
Llama-3.3-70B-Instruct
N/AN/AN/AN/AN/A
50
Llama-4-Maverick-17B-128E-Instruct
N/AN/AN/AN/AN/A
51
Llama-4-Scout-17B-16E-Instruct
N/AN/AN/AN/AN/A
52
Qwen2.5-72B-Instruct
N/AN/AN/AN/AN/A
53
Qwen2.5-Coder-32B-Instruct
N/AN/AN/AN/AN/A
54
Qwen3-235B-A22B
N/AN/AN/AN/AN/A
55
Qwen3-235B-A22B no-thinking
N/AN/AN/AN/AN/A
56
Qwen3-235B-A22B thinking
N/AN/AN/AN/AN/A
57
Qwen3-32B
N/AN/AN/AN/AN/A
58
Qwen3-32B no-thinking
N/AN/AN/AN/AN/A
59
Qwen3-32B thinking
N/AN/AN/AN/AN/A

News

  • [2025-11-13]:
    • Added new model to the leaderboard: MiniMax M2.
  • [2025-10-28]:
    • Added new model to the leaderboard: GLM-4.6.
  • [2025-10-09]:
    • Added new models to the leaderboard: Claude Sonnet 4.5, gpt-5-codex, Claude Opus 4.1, Qwen3-30B-A3B-Thinking-2507 and Qwen3-30B-A3B-Instruct-2507.
    • Added a new Insights section providing analysis and key takeaways from recent model and data releases.
    • Deprecated following models:
      • Text: Llama-3.3-70B-Instruct, Llama-4-Maverick-17B-128E-Instruct, gemma-3-27b-it and Qwen2.5-72B-Instruct.
      • Tools: Claude Sonnet 3.5, Kimi K2, gemini-2.0-flash, Qwen3-235B-A22B and Qwen3-32B.
  • [2025-09-17]:
    • Added new models to the leaderboard: Grok 4, Kimi K2 Instruct 0905, DeepSeek-V3.1 and Qwen3-Next-80B-A3B-Instruct.
  • [2025-09-04]:
    • Added new models to the leaderboard: GLM-4.5, GLM-4.5 Air, Grok Code Fast 1, Kimi K2, gpt-5-mini-2025-08-07-medium, gpt-oss-120b and gpt-oss-20b.
    • Introduced Cost per Problem and Tokens per Problem columns.
    • Added links to the pull requests within the selected time window. You can review them via the Inspect button.
    • Deprecated following models:
      • Text: DeepSeek-V3, DeepSeek-V3-0324, Devstral-Small-2505, gemini-2.0-flash, gpt-4.1-2025-04-14, gpt-4.1-mini-2025-04-14, gpt-4.1-nano-2025-04-14, Llama-4-Scout-17B-16E-Instruct and Qwen2.5-Coder-32B-Instruct.
      • Tools: horizon-alpha and horizon-beta.
  • [2025-08-12]: Added new models to the leaderboard: gpt-5-medium-2025-08-07, gpt-5-high-2025-08-07 and gpt-5-minimal-2025-08-07.
  • [2025-08-02]: Added new models to the leaderboard: Qwen3-Coder-30B-A3B-Instruct, horizon-beta.
  • [2025-07-31]:
    • Added new models to the leaderboard: gemini-2.5-pro, gemini-2.5-flash, o4-mini-2025-04-16, Qwen3-Coder-480B-A35B-Instruct, Qwen3-235B-A22B-Thinking-2507, Qwen3-235B-A22B-Instruct-2507, DeepSeek-R1-0528 and horizon-alpha.
    • Deprecated models: gemini-2.5-flash-preview-05-20 no-thinking.
    • Updated demo format: tool calls are now shown as distinct assistant and tool messages.
  • [2025-07-11]: Released Docker images for all leaderboard problems and published a dedicated HuggingFace dataset containing only the problems used in the leaderboard.
  • [2025-07-10]: Added models performance chart and evaluations on June data.
  • [2025-06-12]: Added tool usage support, evaluations on May data and new models: Claude Sonnet 3.5/4 and o3.
  • [2025-05-22]: Added Devstral-Small-2505 to the leaderboard.
  • [2025-05-21]: Added new models to the leaderboard: gpt-4.1-mini-2025-04-14, gpt-4.1-nano-2025-04-14, gemini-2.0-flash and gemini-2.5-flash-preview-05-20.