01/08/2025
01/09/2025
012345678910111213141516
52 problems from 51 repositories selected within the current time window.
You can adjust time window, modifying the problems' release start and end dates. Evaluations highlighted in red may be potentially contaminated, meaning they include tasks that were created before the model's release date. If a selected time window is wider than the range of task dates a model was evaluated on, the model will appear at the bottom of the leaderboard with N/A values.
Rank
Model
Resolved Rate (%)
Resolved Rate SEM (±)
Pass@5 (%)
Cost per Problem ($)
Tokens per Problem
1
Claude Sonnet 4
49.4%1.18%59.6%$5.291,725,759
2
gpt-5-2025-08-07-high
46.5%1.41%55.8%$1.381,515,864
3
gpt-5-2025-08-07-medium
45.4%1.88%59.6%$0.941,167,807
4
GLM-4.5
45.0%0.98%55.8%$0.841,377,628
5
gpt-5-mini-2025-08-07-medium
43.1%2.07%55.8%$0.28965,664
6
Kimi K2 Instruct 0905
42.3%1.36%53.8%$0.881,449,640
7
Grok 4
41.7%1.99%53.8%$1.401,185,331
8
Qwen3-Coder-480B-A35B-Instruct
40.7%1.92%59.6%$0.511,261,177
9
gpt-4.1-2025-04-14
37.4%1.63%50.0%$0.31437,071
10
Grok Code Fast 1
37.3%1.98%53.8%$0.051,202,590
11
o3-2025-04-16
36.5%1.22%48.1%$1.041,368,526
12
Qwen3-235B-A22B-Instruct-2507
35.8%0.98%50.0%$0.18887,059
13
GLM-4.5 Air
34.7%1.17%44.2%$0.281,388,515
14
Kimi K2
34.6%0.86%51.9%$0.46892,937
15
gpt-5-2025-08-07-minimal
33.5%1.56%46.2%$0.38535,089
16
DeepSeek-V3.1
32.7%1.22%51.9%$0.341,240,630
17
Claude Sonnet 3.5
31.6%1.61%50.0%$2.78906,160
18
Qwen3-Next-80B-A3B-Instruct
29.7%1.22%44.2%$0.30581,592
19
o4-mini-2025-04-16
29.2%1.28%44.2%$0.881,749,686
20
DeepSeek-R1-0528
29.2%1.65%46.2%$0.36382,652
21
Qwen3-Coder-30B-A3B-Instruct
29.2%1.12%40.4%$0.07660,609
22
gemini-2.5-pro
28.8%1.36%42.3%$0.811,069,479
23
gpt-oss-120b
26.5%0.72%38.5%
24
Qwen3-235B-A22B-Thinking-2507
26.5%0.94%36.5%$0.36398,882
25
gemini-2.5-flash
25.8%1.56%42.3%$0.151,191,967
26
DeepSeek-V3-0324
24.6%0.72%34.6%$0.16318,063
27
gpt-4.1-mini-2025-04-14
19.2%1.61%34.6%$0.191,085,329
28
Qwen3-235B-A22B
18.5%0.98%34.6%$0.07327,505
29
gemini-2.0-flash
15.5%0.87%28.8%$0.111,071,098
30
Llama-3.3-70B-Instruct
10.0%0.38%19.2%$0.07796,051
31
Qwen2.5-72B-Instruct
9.6%1.05%26.9%$0.161,187,159
32
gpt-oss-20b
8.1%1.12%23.1%
33
Qwen3-32B
8.1%2.05%19.2%$0.05436,355
34
Llama-4-Maverick-17B-128E-Instruct
6.6%1.15%23.1%$0.05278,140
35
gemma-3-27b-it
2.3%0.72%5.8%$0.03309,611
36
DeepSeek-V3
N/AN/AN/AN/AN/A
37
DeepSeek-V3-0324
N/AN/AN/AN/AN/A
38
Devstral-Small-2505
N/AN/AN/AN/AN/A
39
gemini-2.0-flash
N/AN/AN/AN/AN/A
40
gemini-2.5-flash-preview-05-20 no-thinking
N/AN/AN/AN/AN/A
41
gemini-2.5-flash-preview-05-20 no-thinking
N/AN/AN/AN/AN/A
42
gpt-4.1-2025-04-14
N/AN/AN/AN/AN/A
43
gpt-4.1-mini-2025-04-14
N/AN/AN/AN/AN/A
44
gpt-4.1-nano-2025-04-14
N/AN/AN/AN/AN/A
45
horizon-alpha
N/AN/AN/AN/AN/A
46
horizon-beta
N/AN/AN/AN/AN/A
47
Llama-4-Scout-17B-16E-Instruct
N/AN/AN/AN/AN/A
48
Qwen2.5-Coder-32B-Instruct
N/AN/AN/AN/AN/A
49
Qwen3-235B-A22B no-thinking
N/AN/AN/AN/AN/A
50
Qwen3-235B-A22B thinking
N/AN/AN/AN/AN/A
51
Qwen3-32B no-thinking
N/AN/AN/AN/AN/A
52
Qwen3-32B thinking
N/AN/AN/AN/AN/A

News

  • [2025-09-17]:
    • Added new models to the leaderboard: Grok 4, Kimi K2 Instruct 0905, DeepSeek-V3.1 and Qwen3-Next-80B-A3B-Instruct.
  • [2025-09-04]:
    • Added new models to the leaderboard: GLM-4.5, GLM-4.5 Air, Grok Code Fast 1, Kimi K2, gpt-5-mini-2025-08-07-medium, gpt-oss-120b and gpt-oss-20b.
    • Introduced Cost per Problem and Tokens per Problem columns.
    • Added links to the pull requests within the selected time window. You can review them via the Inspect button.
    • Deprecated following models:
      • Text: DeepSeek-V3, DeepSeek-V3-0324, Devstral-Small-2505, gemini-2.0-flash, gpt-4.1-2025-04-14, gpt-4.1-mini-2025-04-14, gpt-4.1-nano-2025-04-14, Llama-4-Scout-17B-16E-Instruct and Qwen2.5-Coder-32B-Instruct.
      • Tools: horizon-alpha and horizon-beta.
  • [2025-08-12]: Added new models to the leaderboard: gpt-5-medium-2025-08-07, gpt-5-high-2025-08-07 and gpt-5-minimal-2025-08-07.
  • [2025-08-02]: Added new models to the leaderboard: Qwen3-Coder-30B-A3B-Instruct, horizon-beta.
  • [2025-07-31]:
    • Added new models to the leaderboard: gemini-2.5-pro, gemini-2.5-flash, o4-mini-2025-04-16, Qwen3-Coder-480B-A35B-Instruct, Qwen3-235B-A22B-Thinking-2507, Qwen3-235B-A22B-Instruct-2507, DeepSeek-R1-0528 and horizon-alpha.
    • Deprecated models: gemini-2.5-flash-preview-05-20 no-thinking.
    • Updated demo format: tool calls are now shown as distinct assistant and tool messages.
  • [2025-07-11]: Released Docker images for all leaderboard problems and published a dedicated HuggingFace dataset containing only the problems used in the leaderboard.
  • [2025-07-10]: Added models performance chart and evaluations on June data.
  • [2025-06-12]: Added tool usage support, evaluations on May data and new models: Claude Sonnet 3.5/4 and o3.
  • [2025-05-22]: Added Devstral-Small-2505 to the leaderboard.
  • [2025-05-21]: Added new models to the leaderboard: gpt-4.1-mini-2025-04-14, gpt-4.1-nano-2025-04-14, gemini-2.0-flash and gemini-2.5-flash-preview-05-20.