We made some meaningful changes in the methodology:
Key Takeaways:
--model=opus --allowedTools="Bash,Read,Write,Edit" --permission-mode acceptEdits --output-format stream-json --verbose. We also set ANTHROPIC_DEFAULT_OPUS_MODEL=claude-opus-4-6 to use Opus 4.6 as the primary model.--model=gpt-5.4 -c features.web_search_request=false -c model_reasoning_effort=medium --yolo --json.--skip-update-check --output-format=json. In our default setup, Junie used gemini-3-flash-preview, gpt-4.1-mini-2025-04-14, and gpt-4.1-2025-04-14 for summarization.
junie --skip-update-check --output-format=json. In our default setup, Junie used gemini-3-flash-preview, gpt-4.1-mini-2025-04-14, and gpt-4.1-2025-04-14 for summarization.--model=opus --allowedTools="Bash,Read,Write,Edit" --permission-mode acceptEdits --output-format stream-json --verbose. We also set ANTHROPIC_DEFAULT_OPUS_MODEL=claude-opus-4-6 to use Opus 4.6 as the primary model.--model=gpt-5.2-codex --yolo --json. We plan to add gpt-5.3-codex once API access becomes available.

--model=opus --allowedTools="Bash,Read" --permission-mode acceptEdits --output-format stream-json --verbose. This resulted in a mixed execution pattern where Opus 4.5 handles core reasoning and Haiku 4.5 is delegated auxiliary tasks. Across trajectories, ~30% of steps originate from Haiku, with the remaining majority from Opus 4.5. We use version 2.0.62 of Claude Code. In rare instances (1–2 out of 47 tasks), Claude Code attempts to use prohibited tools like WebFetch or user approval, resulting in timeouts and task failure.



Rank | Model | Resolved Rate (%) | Resolved Rate SEM (±) | Pass@5 (%) | Cost per Problem ($) | Tokens per Problem | Cached Tokens (%) |
|---|---|---|---|---|---|---|---|
| 1 | Claude Opus 4.6 | 65.3% | 0.66% | 70.2% | $1.12 | 1,296,667 | 92.9% |
| 2 | gpt-5.2-2025-12-11-medium | 64.4% | 1.85% | 73.7% | $0.62 | 1,009,563 | 78.3% |
| 3 | GLM-5 | 62.8% | 0.86% | 70.2% | $0.76 | 2,218,525 | 78.7% |
| 4 | gpt-5.4-2026-03-05-medium | 62.8% | 1.79% | 70.2% | $0.63 | 774,607 | 77.6% |
| 5 | Gemini 3.1 Pro Preview | 62.3% | 0.76% | 75.4% | $0.66 | 1,420,386 | 81.6% |
| 6 | DeepSeek-V3.2 | 60.9% | 1.45% | 73.7% | $0.75 | 3,056,040 | 15.7% |
| 7 | Claude Sonnet 4.6 | 60.7% | 0.89% | 70.2% | $1.02 | 2,086,974 | 95.3% |
| 8 | Claude Sonnet 4.5 | 60.0% | 0.71% | 69.6% | $1.18 | 2,408,578 | 96.3% |
| 9 | Qwen3.5-397B-A17B | 59.9% | 1.64% | 71.9% | $1.18 | 2,957,222 | 91.4% |
| 10 | Step-3.5-Flash | 59.6% | 0.54% | 71.9% | $0.14 | 3,745,323 | 81.9% |
| 11 | Junie | 59.5% | 1.11% | 75.4% | $0.31 | 1,782,125 | 77.2% |
| 12 | gpt-5.3-codex-xhigh | 58.6% | 0.89% | 70.2% | $0.89 | 1,771,028 | 84.4% |
| 13 | Kimi K2.5 | 58.5% | 0.66% | 70.2% | $0.21 | 1,536,041 | 93.9% |
| 14 | Claude Code | 58.4% | 1.15% | 68.4% | $4.91 | 3,063,081 | 92.3% |
| 15 | Codex | 58.3% | 0.75% | 64.9% | $0.61 | 1,450,760 | 95.0% |
| 16 | gpt-5.3-codex | 58.2% | 1.34% | 62.5% | $0.40 | 680,434 | 79.1% |
| 17 | Kimi K2 Thinking | 57.4% | 2.76% | 71.9% | $0.62 | 3,411,000 | 95.7% |
| 18 | gpt-5.2-codex | 56.8% | 1.54% | 64.3% | $0.42 | 732,529 | 74.9% |
| 19 | MiniMax M2.5 | 54.6% | 2.05% | 63.2% | $0.13 | 2,325,201 | 90.4% |
| 20 | Qwen3-Coder-Next | 54.4% | 0.62% | 63.2% | $1.67 | 8,137,372 | 98.5% |
| 21 | Qwen3.5-35B-A3B | 53.7% | 2.52% | 63.2% | $0.81 | 4,753,719 | 94.3% |
| 22 | Gemini 3 Flash Preview | 52.5% | 1.06% | 68.4% | $0.26 | 1,945,862 | 75.9% |
| 23 | Devstral-2-123B-Instruct-2512 | 48.8% | 2.31% | 59.6% | $0.09 | 1,677,742 | 97.3% |
| 24 | Qwen3-Coder-480B-A35B-Instruct | 44.7% | 0.63% | 56.1% | $0.45 | 2,231,580 | 29.0% |
| 25 | Devstral-Small-2-24B-Instruct-2512 | 38.9% | 2.14% | 63.2% | $0.13 | 2,075,167 | 97.3% |
| 26 | GLM-4.5 Air | 38.3% | 1.41% | 57.9% | $0.08 | 1,771,630 | 93.9% |
| 27 | GLM-4.7 Flash | 34.0% | 2.19% | 56.1% | $0.08 | 4,556,686 | 90.3% |
| 28 | gpt-oss-120b | 33.3% | 1.13% | 47.4% | $0.19 | 1,271,158 | 94.3% |
| 29 | Claude Opus 4.1 | N/A | N/A | N/A | N/A | N/A | N/A |
| 30 | Claude Opus 4.5 | N/A | N/A | N/A | N/A | N/A | N/A |
| 31 | Claude Sonnet 3.5 | N/A | N/A | N/A | N/A | N/A | N/A |
| 32 | Claude Sonnet 4 | N/A | N/A | N/A | N/A | N/A | N/A |
| 33 | DeepSeek-R1-0528 | N/A | N/A | N/A | N/A | N/A | N/A |
| 34 | DeepSeek-V3 | N/A | N/A | N/A | N/A | N/A | N/A |
| 35 | DeepSeek-V3-0324 | N/A | N/A | N/A | N/A | N/A | N/A |
| 36 | DeepSeek-V3-0324 | N/A | N/A | N/A | N/A | N/A | N/A |
| 37 | DeepSeek-V3.1 | N/A | N/A | N/A | N/A | N/A | N/A |
| 38 | Devstral-Small-2505 | N/A | N/A | N/A | N/A | N/A | N/A |
| 39 | Gemini 3 Pro Preview | N/A | N/A | N/A | N/A | N/A | N/A |
| 40 | gemini-2.0-flash | N/A | N/A | N/A | N/A | N/A | N/A |
| 41 | gemini-2.0-flash | N/A | N/A | N/A | N/A | N/A | N/A |
| 42 | gemini-2.5-flash | N/A | N/A | N/A | N/A | N/A | N/A |
| 43 | gemini-2.5-flash-preview-05-20 no-thinking | N/A | N/A | N/A | N/A | N/A | N/A |
| 44 | gemini-2.5-flash-preview-05-20 no-thinking | N/A | N/A | N/A | N/A | N/A | N/A |
| 45 | gemini-2.5-pro | N/A | N/A | N/A | N/A | N/A | N/A |
| 46 | gemma-3-27b-it | N/A | N/A | N/A | N/A | N/A | N/A |
| 47 | GLM-4.5 | N/A | N/A | N/A | N/A | N/A | N/A |
| 48 | GLM-4.6 | N/A | N/A | N/A | N/A | N/A | N/A |
| 49 | GLM-4.7 | N/A | N/A | N/A | N/A | N/A | N/A |
| 50 | gpt-4.1-2025-04-14 | N/A | N/A | N/A | N/A | N/A | N/A |
| 51 | gpt-4.1-2025-04-14 | N/A | N/A | N/A | N/A | N/A | N/A |
| 52 | gpt-4.1-mini-2025-04-14 | N/A | N/A | N/A | N/A | N/A | N/A |
| 53 | gpt-4.1-mini-2025-04-14 | N/A | N/A | N/A | N/A | N/A | N/A |
| 54 | gpt-4.1-nano-2025-04-14 | N/A | N/A | N/A | N/A | N/A | N/A |
| 55 | gpt-5-2025-08-07-high | N/A | N/A | N/A | N/A | N/A | N/A |
| 56 | gpt-5-2025-08-07-medium | N/A | N/A | N/A | N/A | N/A | N/A |
| 57 | gpt-5-2025-08-07-minimal | N/A | N/A | N/A | N/A | N/A | N/A |
| 58 | gpt-5-codex | N/A | N/A | N/A | N/A | N/A | N/A |
| 59 | gpt-5-mini-2025-08-07-high | N/A | N/A | N/A | N/A | N/A | N/A |
| 60 | gpt-5-mini-2025-08-07-medium | N/A | N/A | N/A | N/A | N/A | N/A |
| 61 | gpt-5.1-codex | N/A | N/A | N/A | N/A | N/A | N/A |
| 62 | gpt-5.1-codex-max | N/A | N/A | N/A | N/A | N/A | N/A |
| 63 | gpt-5.2-2025-12-11-xhigh | N/A | N/A | N/A | N/A | N/A | N/A |
| 64 | gpt-oss-120b-high | N/A | N/A | N/A | N/A | N/A | N/A |
| 65 | gpt-oss-20b | N/A | N/A | N/A | N/A | N/A | N/A |
| 66 | Grok 4 | N/A | N/A | N/A | N/A | N/A | N/A |
| 67 | Grok Code Fast 1 | N/A | N/A | N/A | N/A | N/A | N/A |
| 68 | horizon-alpha | N/A | N/A | N/A | N/A | N/A | N/A |
| 69 | horizon-beta | N/A | N/A | N/A | N/A | N/A | N/A |
| 70 | Kimi K2 | N/A | N/A | N/A | N/A | N/A | N/A |
| 71 | Kimi K2 Instruct 0905 | N/A | N/A | N/A | N/A | N/A | N/A |
| 72 | Llama-3.3-70B-Instruct | N/A | N/A | N/A | N/A | N/A | N/A |
| 73 | Llama-4-Maverick-17B-128E-Instruct | N/A | N/A | N/A | N/A | N/A | N/A |
| 74 | Llama-4-Scout-17B-16E-Instruct | N/A | N/A | N/A | N/A | N/A | N/A |
| 75 | MiniMax M2 | N/A | N/A | N/A | N/A | N/A | N/A |
| 76 | MiniMax M2.1 | N/A | N/A | N/A | N/A | N/A | N/A |
| 77 | o3-2025-04-16 | N/A | N/A | N/A | N/A | N/A | N/A |
| 78 | o4-mini-2025-04-16 | N/A | N/A | N/A | N/A | N/A | N/A |
| 79 | Qwen2.5-72B-Instruct | N/A | N/A | N/A | N/A | N/A | N/A |
| 80 | Qwen2.5-Coder-32B-Instruct | N/A | N/A | N/A | N/A | N/A | N/A |
| 81 | Qwen3-235B-A22B | N/A | N/A | N/A | N/A | N/A | N/A |
| 82 | Qwen3-235B-A22B no-thinking | N/A | N/A | N/A | N/A | N/A | N/A |
| 83 | Qwen3-235B-A22B thinking | N/A | N/A | N/A | N/A | N/A | N/A |
| 84 | Qwen3-235B-A22B-Instruct-2507 | N/A | N/A | N/A | N/A | N/A | N/A |
| 85 | Qwen3-235B-A22B-Thinking-2507 | N/A | N/A | N/A | N/A | N/A | N/A |
| 86 | Qwen3-30B-A3B-Instruct-2507 | N/A | N/A | N/A | N/A | N/A | N/A |
| 87 | Qwen3-30B-A3B-Thinking-2507 | N/A | N/A | N/A | N/A | N/A | N/A |
| 88 | Qwen3-32B | N/A | N/A | N/A | N/A | N/A | N/A |
| 89 | Qwen3-32B no-thinking | N/A | N/A | N/A | N/A | N/A | N/A |
| 90 | Qwen3-32B thinking | N/A | N/A | N/A | N/A | N/A | N/A |
| 91 | Qwen3-Coder-30B-A3B-Instruct | N/A | N/A | N/A | N/A | N/A | N/A |
| 92 | Qwen3-Next-80B-A3B-Instruct | N/A | N/A | N/A | N/A | N/A | N/A |
Inspect button.