Time Window
01/03/2026
15/05/2026
0123456789101112131415161718192021222324252627282930313233
110 problems from 86 repositories selected within the current time window.
110 problems, 86 repositories
Insights
Potential contamination
External system
Beyond eval range
#
Model
Resolved Rate (%)
Pass@5 (%)
Cost per Problem ($)
Tokens per Problem
1
OpenAI
gpt-5.5-2026-04-23-xhigh
Model
62.7%± 0.91%
70.0%$2.25
2,120,66090.0% cached
2
Junie
Junie
Agent
61.6%± 0.64%
72.7%$1.84
1,866,49791.6% cached
3
OpenAI
Codex
Agent
60.4%± 1.37%
71.8%$1.75
1,898,13192.5% cached
4
Anthropic
Claude Code
Agent
59.6%± 1.98%
72.7%$1.74
1,878,24893.6% cached
5
OpenAI
gpt-5.5-2026-04-23-medium
Model
58.9%± 0.78%
70.0%$0.98
708,41883.5% cached
6
Anthropic
Claude Opus 4.8-xhigh
Model
56.5%± 1.20%
67.3%$2.02
2,479,38795.3% cached
7
OpenAI
gpt-5.4-2026-03-05-medium
Model
54.9%± 1.02%
70.9%$0.60
834,45283.5% cached
8
Anthropic
Claude Opus 4.7-high
Model
53.1%± 1.45%
66.4%$1.32
1,526,13594.2% cached
9
Cursor
Cursor
Agent
53.0%± 0.53%
64.5%$0.23
1,031,65398.7% cached
10
Anthropic
Claude Sonnet 4.6
Model
51.3%± 0.55%
63.6%$1.29
2,644,57795.6% cached
11
Gemini
Gemini 3.1 Pro Preview
Model
51.1%± 1.20%
66.4%$0.75
1,545,44580.1% cached
12
Z.ai
GLM-5.1
Model
50.7%± 0.93%
65.5%$0.94
2,664,00191.8% cached
13
Gemini
Gemini 3.5 Flash
Model
49.5%± 0.98%
61.8%$0.77
1,848,59375.7% cached
14
Anthropic
Claude Opus 4.6-high
Model
47.8%± 1.37%
60.9%$1.53
1,828,64993.6% cached
15
Kimi
Kimi K2.6
Model
46.5%± 1.27%
64.5%$0.61
2,466,97790.4% cached
16
Minimax
MiniMax M3
Model
45.6%± 1.27%
67.3%$1.06
6,885,81893.5% cached
17
Z.ai
GLM-4.7
Model
38.2%± 0.86%
59.1%$0.39
2,256,18286.4% cached
18
Anthropic
Claude Opus 4.1
Model
N/AN/AN/AN/A
19
Anthropic
Claude Opus 4.5
Model
N/AN/AN/AN/A
20
Anthropic
Claude Sonnet 3.5
Model
N/AN/AN/AN/A
21
Anthropic
Claude Sonnet 4
Model
N/AN/AN/AN/A
22
Anthropic
Claude Sonnet 4.5
Model
N/AN/AN/AN/A
23
DeepSeek
DeepSeek-R1-0528
Model
N/AN/AN/AN/A
24
DeepSeek
DeepSeek-V3
Model
N/AN/AN/AN/A
25
DeepSeek
DeepSeek-V3-0324
Model
N/AN/AN/AN/A
26
DeepSeek
DeepSeek-V3-0324
Model
N/AN/AN/AN/A
27
DeepSeek
DeepSeek-V3.1
Model
N/AN/AN/AN/A
28
DeepSeek
DeepSeek-V3.2
Model
N/AN/AN/AN/A
29
Mistral
Devstral-2-123B-Instruct-2512
Model
N/AN/AN/AN/A
30
Mistral
Devstral-Small-2-24B-Instruct-2512
Model
N/AN/AN/AN/A
31
Mistral
Devstral-Small-2505
Model
N/AN/AN/AN/A
32
Gemini
Gemini 3 Flash Preview
Model
N/AN/AN/AN/A
33
Gemini
Gemini 3 Pro Preview
Model
N/AN/AN/AN/A
34
Gemini
gemini-2.0-flash
Model
N/AN/AN/AN/A
35
Gemini
gemini-2.0-flash
Model
N/AN/AN/AN/A
36
Gemini
gemini-2.5-flash
Model
N/AN/AN/AN/A
37
Gemini
gemini-2.5-flash-preview-05-20 no-thinking
Model
N/AN/AN/AN/A
38
Gemini
gemini-2.5-flash-preview-05-20 no-thinking
Model
N/AN/AN/AN/A
39
Gemini
gemini-2.5-pro
Model
N/AN/AN/AN/A
40
Gemini
Gemma 4 31B
Model
N/AN/AN/AN/A
41
Gemini
gemma-3-27b-it
Model
N/AN/AN/AN/A
42
Z.ai
GLM-4.5
Model
N/AN/AN/AN/A
43
Z.ai
GLM-4.5 Air
Model
N/AN/AN/AN/A
44
Z.ai
GLM-4.6
Model
N/AN/AN/AN/A
45
Z.ai
GLM-4.7 Flash
Model
N/AN/AN/AN/A
46
Z.ai
GLM-5
Model
N/AN/AN/AN/A
47
Z.ai
GLM-5.1
Model
N/AN/AN/AN/A
48
OpenAI
gpt-4.1-2025-04-14
Model
N/AN/AN/AN/A
49
OpenAI
gpt-4.1-2025-04-14
Model
N/AN/AN/AN/A
50
OpenAI
gpt-4.1-mini-2025-04-14
Model
N/AN/AN/AN/A
51
OpenAI
gpt-4.1-mini-2025-04-14
Model
N/AN/AN/AN/A
52
OpenAI
gpt-4.1-nano-2025-04-14
Model
N/AN/AN/AN/A
53
OpenAI
gpt-5-2025-08-07-high
Model
N/AN/AN/AN/A
54
OpenAI
gpt-5-2025-08-07-medium
Model
N/AN/AN/AN/A
55
OpenAI
gpt-5-2025-08-07-minimal
Model
N/AN/AN/AN/A
56
OpenAI
gpt-5-codex
Model
N/AN/AN/AN/A
57
OpenAI
gpt-5-mini-2025-08-07-high
Model
N/AN/AN/AN/A
58
OpenAI
gpt-5-mini-2025-08-07-medium
Model
N/AN/AN/AN/A
59
OpenAI
gpt-5.1-codex
Model
N/AN/AN/AN/A
60
OpenAI
gpt-5.1-codex-max
Model
N/AN/AN/AN/A
61
OpenAI
gpt-5.2-2025-12-11-medium
Model
N/AN/AN/AN/A
62
OpenAI
gpt-5.2-2025-12-11-xhigh
Model
N/AN/AN/AN/A
63
OpenAI
gpt-5.2-codex
Model
N/AN/AN/AN/A
64
OpenAI
gpt-5.3-codex
Model
N/AN/AN/AN/A
65
OpenAI
gpt-5.3-codex-xhigh
Model
N/AN/AN/AN/A
66
OpenAI
gpt-oss-120b
Model
N/AN/AN/AN/A
67
OpenAI
gpt-oss-120b-high
Model
N/AN/AN/AN/A
68
OpenAI
gpt-oss-20b
Model
N/AN/AN/AN/A
69
Grok
Grok 4
Model
N/AN/AN/AN/A
70
Grok
Grok Code Fast 1
Model
N/AN/AN/AN/A
71
OpenRouter
horizon-alpha
Model
N/AN/AN/AN/A
72
OpenRouter
horizon-beta
Model
N/AN/AN/AN/A
73
Kimi
Kimi K2
Model
N/AN/AN/AN/A
74
Kimi
Kimi K2 Instruct 0905
Model
N/AN/AN/AN/A
75
Kimi
Kimi K2 Thinking
Model
N/AN/AN/AN/A
76
Kimi
Kimi K2.5
Model
N/AN/AN/AN/A
77
Meta
Llama-3.3-70B-Instruct
Model
N/AN/AN/AN/A
78
Meta
Llama-4-Maverick-17B-128E-Instruct
Model
N/AN/AN/AN/A
79
Meta
Llama-4-Scout-17B-16E-Instruct
Model
N/AN/AN/AN/A
80
Minimax
MiniMax M2
Model
N/AN/AN/AN/A
81
Minimax
MiniMax M2.1
Model
N/AN/AN/AN/A
82
Minimax
MiniMax M2.5
Model
N/AN/AN/AN/A
83
Minimax
MiniMax M2.7
Model
N/AN/AN/AN/A
84
OpenAI
o3-2025-04-16
Model
N/AN/AN/AN/A
85
OpenAI
o4-mini-2025-04-16
Model
N/AN/AN/AN/A
86
Qwen
Qwen2.5-72B-Instruct
Model
N/AN/AN/AN/A
87
Qwen
Qwen2.5-Coder-32B-Instruct
Model
N/AN/AN/AN/A
88
Qwen
Qwen3-235B-A22B
Model
N/AN/AN/AN/A
89
Qwen
Qwen3-235B-A22B no-thinking
Model
N/AN/AN/AN/A
90
Qwen
Qwen3-235B-A22B thinking
Model
N/AN/AN/AN/A
91
Qwen
Qwen3-235B-A22B-Instruct-2507
Model
N/AN/AN/AN/A
92
Qwen
Qwen3-235B-A22B-Thinking-2507
Model
N/AN/AN/AN/A
93
Qwen
Qwen3-30B-A3B-Instruct-2507
Model
N/AN/AN/AN/A
94
Qwen
Qwen3-30B-A3B-Thinking-2507
Model
N/AN/AN/AN/A
95
Qwen
Qwen3-32B
Model
N/AN/AN/AN/A
96
Qwen
Qwen3-32B no-thinking
Model
N/AN/AN/AN/A
97
Qwen
Qwen3-32B thinking
Model
N/AN/AN/AN/A
98
Qwen
Qwen3-Coder-30B-A3B-Instruct
Model
N/AN/AN/AN/A
99
Qwen
Qwen3-Coder-480B-A35B-Instruct
Model
N/AN/AN/AN/A
100
Qwen
Qwen3-Coder-Next
Model
N/AN/AN/AN/A
101
Qwen
Qwen3-Next-80B-A3B-Instruct
Model
N/AN/AN/AN/A
102
Qwen
Qwen3.5-27B
Model
N/AN/AN/AN/A
103
Qwen
Qwen3.5-35B-A3B
Model
N/AN/AN/AN/A
104
Qwen
Qwen3.5-397B-A17B
Model
N/AN/AN/AN/A
105
Stepfun
Step-3.5-Flash
Model
N/AN/AN/AN/A

News

  • [2026-06-09]:
    • Added new models to the leaderboad: Gemini 3.5 Flash and MiniMax M3.
  • [2026-05-28]:
    • Added new models to the leaderboad: Claude Opus 4.8.
  • [2026-05-27]:
    • Added new models to the leaderboad: gpt-5.5-2026-04-23-xhigh, gpt-5.5-2026-04-23-medium, gpt-5.4-2026-03-05-medium, Claude Opus 4.7, and Kimi K2.6.
  • [2026-04-19]:
    • Re-run the Junie with Claude Opus 4.6 as the primary model.
  • [2026-04-15]:
    • Added new models to the leaderboad: GLM-5.1, Qwen3.5-27B, Cursor, Gemma 4 31B and MiniMax M2.7.
  • [2026-03-20]:
    • Added new models to the leaderboard: gpt-5.4-2026-03-05-medium, Gemini 3.1 Pro Preview, Claude Sonnet 4.6, Qwen3.5-397B-A17B, gpt-5.3-codex-xhigh, gpt-5.3-codex and Qwen3.5-35B-A3B
    • Deprecated following models: gpt-5.2-2025-12-11-xhigh, gpt-5.1-codex-max, gpt-5.1-codex, gpt-5-mini-2025-08-07-high, gpt-5-mini-2025-08-07-medium, Qwen3-235B-A22B-Instruct-2507, DeepSeek-R1-0528, Qwen3-Coder-30B-A3B-Instruct, Qwen3-Next-80B-A3B-Instruct and Qwen3-30B-A3B-Instruct-2507.
  • [2026-03-09]:
    • Added reference evaluation for Junie CLI (highlighted in orange). See setup details in Insights.
  • [2026-02-13]:
    • Added new models to the leaderboard: Claude Opus 4.6, GLM-5, MiniMax M2.5, Codex, Qwen3-Coder-Next, GLM-4.7 Flash, gpt-5.2-codex, GLM-4.7 Flash.
  • [2026-01-14]:
    • Added new models to the leaderboard: gpt-5.2-2025-12-11-xhigh, gpt-5.1-codex, GLM-4.7, gpt-5-mini-2025-08-07-high, gpt-oss-120b-high, Kimi K2 Thinking.
    • Deprecated following models: gpt-5-2025-08-07-medium, gpt-5-2025-08-07-high, Claude Sonnet 4, Claude Opus 4.1, o3-2025-04-16, gpt-5-codex, GLM-4.5, o4-mini-2025-04-16, gpt-5-2025-08-07-minimal, gpt-4.1-2025-04-14, Qwen3-235B-A22B-Thinking-2507, gpt-4.1-mini-2025-04-14, Qwen3-30B-A3B-Thinking-2507.
  • [2025-12-22]:
    • Added new model to the leaderboard: MiniMax M2.1.
  • [2025-12-17]:
    • Added new models to the leaderboard: gpt-5.1-codex-max, gpt-5.2-2025-12-11-medium, Devstral-2-123B-Instruct-2512, Devstral-Small-2-24B-Instruct-2512, DeepSeek-V3.2.
    • Added reference evaluation for Claude Code (highlighted in orange). See setup details in Insights.
    • Deprecated following models: gemini-2.5-pro, gemini-2.5-flash, DeepSeek-V3.1.
  • [2025-12-08]:
    • Added new model to the leaderboard: Gemini 3 Pro Preview.
  • [2025-12-05]:
    • Introduced Cached Tokens column.
  • [2025-11-25]:
    • Added new model to the leaderboard: Claude Opus 4.5.
  • [2025-11-13]:
    • Added new model to the leaderboard: MiniMax M2.
  • [2025-10-28]:
    • Added new model to the leaderboard: GLM-4.6.
  • [2025-10-09]:
    • Added new models to the leaderboard: Claude Sonnet 4.5, gpt-5-codex, Claude Opus 4.1, Qwen3-30B-A3B-Thinking-2507 and Qwen3-30B-A3B-Instruct-2507.
    • Added a new Insights section providing analysis and key takeaways from recent model and data releases.
    • Deprecated following models:
      • Text: Llama-3.3-70B-Instruct, Llama-4-Maverick-17B-128E-Instruct, gemma-3-27b-it and Qwen2.5-72B-Instruct.
      • Tools: Claude Sonnet 3.5, Kimi K2, gemini-2.0-flash, Qwen3-235B-A22B and Qwen3-32B.
  • [2025-09-17]:
    • Added new models to the leaderboard: Grok 4, Kimi K2 Instruct 0905, DeepSeek-V3.1 and Qwen3-Next-80B-A3B-Instruct.
  • [2025-09-04]:
    • Added new models to the leaderboard: GLM-4.5, GLM-4.5 Air, Grok Code Fast 1, Kimi K2, gpt-5-mini-2025-08-07-medium, gpt-oss-120b and gpt-oss-20b.
    • Introduced Cost per Problem and Tokens per Problem columns.
    • Added links to the pull requests within the selected time window. You can review them via the Inspect button.
    • Deprecated following models:
      • Text: DeepSeek-V3, DeepSeek-V3-0324, Devstral-Small-2505, gemini-2.0-flash, gpt-4.1-2025-04-14, gpt-4.1-mini-2025-04-14, gpt-4.1-nano-2025-04-14, Llama-4-Scout-17B-16E-Instruct and Qwen2.5-Coder-32B-Instruct.
      • Tools: horizon-alpha and horizon-beta.
  • [2025-08-12]: Added new models to the leaderboard: gpt-5-medium-2025-08-07, gpt-5-high-2025-08-07 and gpt-5-minimal-2025-08-07.
  • [2025-08-02]: Added new models to the leaderboard: Qwen3-Coder-30B-A3B-Instruct, horizon-beta.
  • [2025-07-31]:
    • Added new models to the leaderboard: gemini-2.5-pro, gemini-2.5-flash, o4-mini-2025-04-16, Qwen3-Coder-480B-A35B-Instruct, Qwen3-235B-A22B-Thinking-2507, Qwen3-235B-A22B-Instruct-2507, DeepSeek-R1-0528 and horizon-alpha.
    • Deprecated models: gemini-2.5-flash-preview-05-20 no-thinking.
    • Updated demo format: tool calls are now shown as distinct assistant and tool messages.
  • [2025-07-11]: Released Docker images for all leaderboard problems and published a dedicated HuggingFace dataset containing only the problems used in the leaderboard.
  • [2025-07-10]: Added models performance chart and evaluations on June data.
  • [2025-06-12]: Added tool usage support, evaluations on May data and new models: Claude Sonnet 3.5/4 and o3.
  • [2025-05-22]: Added Devstral-Small-2505 to the leaderboard.
  • [2025-05-21]: Added new models to the leaderboard: gpt-4.1-mini-2025-04-14, gpt-4.1-nano-2025-04-14, gemini-2.0-flash and gemini-2.5-flash-preview-05-20.