junie --skip-update-check --output-format=json. In our default setup, Junie used gemini-3-flash-preview, gpt-4.1-mini-2025-04-14, and gpt-4.1-2025-04-14 for summarization.--model=opus --allowedTools="Bash,Read,Write,Edit" --permission-mode acceptEdits --output-format stream-json --verbose. We also set ANTHROPIC_DEFAULT_OPUS_MODEL=claude-opus-4-6 to use Opus 4.6 as the primary model.--model=gpt-5.2-codex --yolo --json. We plan to add gpt-5.3-codex once API access becomes available.

--model=opus --allowedTools="Bash,Read" --permission-mode acceptEdits --output-format stream-json --verbose. This resulted in a mixed execution pattern where Opus 4.5 handles core reasoning and Haiku 4.5 is delegated auxiliary tasks. Across trajectories, ~30% of steps originate from Haiku, with the remaining majority from Opus 4.5. We use version 2.0.62 of Claude Code. In rare instances (1–2 out of 47 tasks), Claude Code attempts to use prohibited tools like WebFetch or user approval, resulting in timeouts and task failure.



Rank | Model | Resolved Rate (%) | Resolved Rate SEM (±) | Pass@5 (%) | Cost per Problem ($) | Tokens per Problem | Cached Tokens (%) |
|---|---|---|---|---|---|---|---|
| 1 | Claude Code | 52.9% | 1.06% | 70.8% | $3.50 | 2,088,226 | 92.1% |
| 2 | Junie | 52.1% | 1.14% | 62.5% | $0.27 | 1,489,870 | 78.4% |
| 3 | Claude Opus 4.6 | 51.7% | 0.42% | 58.3% | $0.93 | 1,031,373 | 94.3% |
| 4 | gpt-5.2-2025-12-11-xhigh | 51.7% | 1.21% | 58.3% | $1.28 | 1,995,966 | 81.5% |
| 5 | gpt-5.2-2025-12-11-medium | 51.0% | 1.04% | 60.4% | $0.76 | 981,139 | 68.3% |
| 6 | gpt-5.1-codex-max | 48.5% | 1.13% | 56.3% | $0.73 | 1,239,950 | 67.1% |
| 7 | Claude Sonnet 4.5 | 47.1% | 1.69% | 60.4% | $0.94 | 1,924,648 | 96.4% |
| 8 | Gemini 3 Pro Preview | 46.7% | 2.04% | 58.3% | $0.59 | 1,221,222 | 84.6% |
| 9 | Gemini 3 Flash Preview | 46.7% | 1.41% | 54.2% | $0.32 | 2,173,478 | 77.5% |
| 10 | gpt-5.2-codex | 45.0% | 1.69% | 54.2% | $0.46 | 579,616 | 66.1% |
| 11 | Codex | 44.0% | 2.46% | 55.3% | $0.29 | 580,361 | 86.9% |
| 12 | Claude Opus 4.5 | 43.8% | 0.93% | 58.3% | $1.19 | 1,426,974 | 95.3% |
| 13 | Kimi K2 Thinking | 43.8% | 1.47% | 58.3% | $0.42 | 2,242,684 | 95.1% |
| 14 | gpt-5.1-codex | 42.9% | 1.25% | 50.0% | $0.64 | 1,790,759 | 84.2% |
| 15 | GLM-5 | 42.1% | 1.21% | 50.0% | $0.45 | 1,426,726 | 84.1% |
| 16 | GLM-4.7 | 41.3% | 2.12% | 56.3% | $0.27 | 1,866,019 | 94.1% |
| 17 | Qwen3-Coder-Next | 40.0% | 1.21% | 64.6% | $0.49 | 2,341,400 | 97.6% |
| 18 | MiniMax M2.5 | 39.6% | 0.66% | 56.3% | $0.09 | 1,391,598 | 89.5% |
| 19 | Kimi K2.5 | 37.9% | 1.21% | 50.0% | $0.18 | 1,156,152 | 90.2% |
| 20 | Devstral-2-123B-Instruct-2512 | 37.5% | 2.19% | 52.1% | $0.09 | 1,743,224 | 96.6% |
| 21 | DeepSeek-V3.2 | 37.5% | 1.14% | 45.8% | $0.15 | 2,120,848 | 85.1% |
| 22 | GLM-4.6 | 37.1% | 1.02% | 54.2% | $0.23 | 1,684,257 | 95.8% |
| 23 | gpt-5-mini-2025-08-07-high | 35.0% | 1.21% | 54.2% | $0.70 | 1,624,824 | 77.0% |
| 24 | Kimi K2 Instruct 0905 | 34.3% | 0.98% | 43.8% | $0.33 | 1,798,982 | 93.2% |
| 25 | Devstral-Small-2-24B-Instruct-2512 | 32.1% | 1.41% | 47.9% | $0.12 | 2,017,978 | 97.0% |
| 26 | GLM-4.5 Air | 31.8% | 1.83% | 45.8% | $0.07 | 1,375,045 | 94.1% |
| 27 | MiniMax M2.1 | 31.7% | 2.12% | 47.9% | $0.11 | 1,681,919 | 89.3% |
| 28 | Qwen3-Coder-480B-A35B-Instruct | 31.7% | 1.79% | 41.7% | $0.33 | 1,642,089 | 96.4% |
| 29 | gpt-5-mini-2025-08-07-medium | 30.8% | 1.02% | 41.7% | $0.31 | 1,029,978 | 64.0% |
| 30 | GLM-4.7 Flash | 25.4% | 1.21% | 41.7% | $0.05 | 1,948,975 | 74.3% |
| 31 | gpt-oss-120b | 24.6% | 0.78% | 35.4% | $0.17 | 1,086,528 | 94.1% |
| 32 | Qwen3-235B-A22B-Instruct-2507 | 23.8% | 0.51% | 33.3% | $0.22 | 1,077,744 | 94.0% |
| 33 | DeepSeek-R1-0528 | 21.7% | 1.25% | 39.6% | $0.41 | 431,985 | 0.0% |
| 34 | Qwen3-Coder-30B-A3B-Instruct | 18.0% | 0.47% | 29.2% | $0.05 | 749,054 | 95.1% |
| 35 | Qwen3-Next-80B-A3B-Instruct | 15.4% | 0.83% | 25.0% | $0.12 | 730,412 | 90.4% |
| 36 | Qwen3-30B-A3B-Instruct-2507 | 7.1% | 1.41% | 16.7% | $0.12 | 1,225,164 | 93.0% |
| 37 | Claude Opus 4.1 | N/A | N/A | N/A | N/A | N/A | N/A |
| 38 | Claude Sonnet 3.5 | N/A | N/A | N/A | N/A | N/A | N/A |
| 39 | Claude Sonnet 4 | N/A | N/A | N/A | N/A | N/A | N/A |
| 40 | DeepSeek-V3 | N/A | N/A | N/A | N/A | N/A | N/A |
| 41 | DeepSeek-V3-0324 | N/A | N/A | N/A | N/A | N/A | N/A |
| 42 | DeepSeek-V3-0324 | N/A | N/A | N/A | N/A | N/A | N/A |
| 43 | DeepSeek-V3.1 | N/A | N/A | N/A | N/A | N/A | N/A |
| 44 | Devstral-Small-2505 | N/A | N/A | N/A | N/A | N/A | N/A |
| 45 | gemini-2.0-flash | N/A | N/A | N/A | N/A | N/A | N/A |
| 46 | gemini-2.0-flash | N/A | N/A | N/A | N/A | N/A | N/A |
| 47 | gemini-2.5-flash | N/A | N/A | N/A | N/A | N/A | N/A |
| 48 | gemini-2.5-flash-preview-05-20 no-thinking | N/A | N/A | N/A | N/A | N/A | N/A |
| 49 | gemini-2.5-flash-preview-05-20 no-thinking | N/A | N/A | N/A | N/A | N/A | N/A |
| 50 | gemini-2.5-pro | N/A | N/A | N/A | N/A | N/A | N/A |
| 51 | gemma-3-27b-it | N/A | N/A | N/A | N/A | N/A | N/A |
| 52 | GLM-4.5 | N/A | N/A | N/A | N/A | N/A | N/A |
| 53 | gpt-4.1-2025-04-14 | N/A | N/A | N/A | N/A | N/A | N/A |
| 54 | gpt-4.1-2025-04-14 | N/A | N/A | N/A | N/A | N/A | N/A |
| 55 | gpt-4.1-mini-2025-04-14 | N/A | N/A | N/A | N/A | N/A | N/A |
| 56 | gpt-4.1-mini-2025-04-14 | N/A | N/A | N/A | N/A | N/A | N/A |
| 57 | gpt-4.1-nano-2025-04-14 | N/A | N/A | N/A | N/A | N/A | N/A |
| 58 | gpt-5-2025-08-07-high | N/A | N/A | N/A | N/A | N/A | N/A |
| 59 | gpt-5-2025-08-07-medium | N/A | N/A | N/A | N/A | N/A | N/A |
| 60 | gpt-5-2025-08-07-minimal | N/A | N/A | N/A | N/A | N/A | N/A |
| 61 | gpt-5-codex | N/A | N/A | N/A | N/A | N/A | N/A |
| 62 | gpt-oss-120b-high | N/A | N/A | N/A | N/A | N/A | N/A |
| 63 | gpt-oss-20b | N/A | N/A | N/A | N/A | N/A | N/A |
| 64 | Grok 4 | N/A | N/A | N/A | N/A | N/A | N/A |
| 65 | Grok Code Fast 1 | N/A | N/A | N/A | N/A | N/A | N/A |
| 66 | horizon-alpha | N/A | N/A | N/A | N/A | N/A | N/A |
| 67 | horizon-beta | N/A | N/A | N/A | N/A | N/A | N/A |
| 68 | Kimi K2 | N/A | N/A | N/A | N/A | N/A | N/A |
| 69 | Llama-3.3-70B-Instruct | N/A | N/A | N/A | N/A | N/A | N/A |
| 70 | Llama-4-Maverick-17B-128E-Instruct | N/A | N/A | N/A | N/A | N/A | N/A |
| 71 | Llama-4-Scout-17B-16E-Instruct | N/A | N/A | N/A | N/A | N/A | N/A |
| 72 | MiniMax M2 | N/A | N/A | N/A | N/A | N/A | N/A |
| 73 | o3-2025-04-16 | N/A | N/A | N/A | N/A | N/A | N/A |
| 74 | o4-mini-2025-04-16 | N/A | N/A | N/A | N/A | N/A | N/A |
| 75 | Qwen2.5-72B-Instruct | N/A | N/A | N/A | N/A | N/A | N/A |
| 76 | Qwen2.5-Coder-32B-Instruct | N/A | N/A | N/A | N/A | N/A | N/A |
| 77 | Qwen3-235B-A22B | N/A | N/A | N/A | N/A | N/A | N/A |
| 78 | Qwen3-235B-A22B no-thinking | N/A | N/A | N/A | N/A | N/A | N/A |
| 79 | Qwen3-235B-A22B thinking | N/A | N/A | N/A | N/A | N/A | N/A |
| 80 | Qwen3-235B-A22B-Thinking-2507 | N/A | N/A | N/A | N/A | N/A | N/A |
| 81 | Qwen3-30B-A3B-Thinking-2507 | N/A | N/A | N/A | N/A | N/A | N/A |
| 82 | Qwen3-32B | N/A | N/A | N/A | N/A | N/A | N/A |
| 83 | Qwen3-32B no-thinking | N/A | N/A | N/A | N/A | N/A | N/A |
| 84 | Qwen3-32B thinking | N/A | N/A | N/A | N/A | N/A | N/A |
Inspect button.