01/06/2025
30/06/2025
0123456789101112
41 problems from 35 repositories selected within the current time window.
You can adjust time window, modifying the problems' release start and end dates. Evaluations highlighted in red may be potentially contaminated, meaning they include tasks that were created before the model's release date. If a selected time window is wider than the range of task dates a model was evaluated on, the model will appear at the bottom of the leaderboard with N/A values.
Rank
Model
Resolved Rate (%)
Resolved Rate SEM (±)
pass@5 (%)
1
Claude Sonnet 4
47.3%2.26%58.5%
2
gpt-4.1-2025-04-14
38.5%1.79%56.1%
3
o3-2025-04-16
38.0%0.60%48.8%
4
DeepSeek-V3-0324
36.1%2.49%53.7%
5
gpt-4.1-2025-04-14
35.1%1.24%48.8%
6
DeepSeek-V3-0324
31.7%1.34%51.2%
7
gpt-4.1-mini-2025-04-14
30.7%2.13%48.8%
8
Claude Sonnet 3.5
30.7%2.63%51.2%
9
gemini-2.5-flash-preview-05-20 no-thinking
30.2%1.46%48.8%
10
gemini-2.5-flash-preview-05-20 no-thinking
29.3%2.56%46.3%
11
DeepSeek-V3
27.3%2.10%53.7%
12
Qwen3-235B-A22B no-thinking
21.5%1.79%36.6%
13
Llama-4-Maverick-17B-128E-Instruct
18.0%2.74%41.5%
14
Llama-3.3-70B-Instruct
17.6%1.42%31.7%
15
Qwen3-235B-A22B thinking
16.6%1.79%31.7%
16
Devstral-Small-2505
16.1%1.98%29.3%
17
Qwen3-32B no-thinking
15.1%0.91%26.8%
18
Qwen3-32B thinking
13.7%1.83%26.8%
19
Qwen2.5-72B-Instruct
13.7%0.98%26.8%
20
Llama-4-Scout-17B-16E-Instruct
13.2%2.26%26.8%
21
gemini-2.0-flash
9.3%0.91%26.8%
22
gemma-3-27b-it
8.8%1.65%17.1%
23
Qwen2.5-Coder-32B-Instruct
3.9%0.98%14.6%
24
gpt-4.1-nano-2025-04-14
0.5%0.49%2.4%

News

  • [2025-07-10]: Added models performance chart and evaluations on June data.
  • [2025-06-12]: Added tool usage support, evaluations on May data and new models: Claude Sonnet 3.5/4 and o3.
  • [2025-05-22]: Added Devstral-Small-2505 to the leaderboard.
  • [2025-05-21]: Added new models to the leaderboard: gpt-4.1-mini-2025-04-14, gpt-4.1-nano-2025-04-14, gemini-2.0-flash and gemini-2.5-flash-preview-05-20.