21 problems from 17 repositories selected within the current time window.
You can adjust time window, modifying the problems' release start and end dates. Evaluations highlighted in red may be potentially contaminated, meaning they include tasks that were created before the model's release date.
Rank | Model | Resolved Rate (%) | Resolved Rate SEM (±) | pass@5 (%) |
---|---|---|---|---|
1 | gpt-4.1-2025-04-14 | 16.2% | 1.17% | 23.8% |
2 | DeepSeek-V3-0324 | 13.3% | 3.16% | 23.8% |
3 | DeepSeek-V3 | 11.4% | 1.90% | 14.3% |
4 | Qwen3-235B-A22B no-thinking | 10.5% | 1.78% | 14.3% |
5 | Qwen3-32B no-thinking | 10.5% | 0.95% | 14.3% |
6 | Qwen3-32B thinking | 9.5% | 0.00% | 9.5% |
7 | Qwen3-235B-A22B thinking | 8.6% | 1.76% | 19.0% |
8 | Llama-4-Maverick-17B-128E-Instruct | 7.6% | 1.90% | 14.3% |
9 | Llama-3.3-70B-Instruct | 7.6% | 1.17% | 14.3% |
10 | Llama-4-Scout-17B-16E-Instruct | 4.8% | 2.13% | 14.3% |
11 | Qwen2.5-72B-Instruct | 3.8% | 0.95% | 4.8% |
12 | gemma-3-27b-it | 3.8% | 0.95% | 9.5% |
13 | Qwen2.5-Coder-32B-Instruct | 1.0% | 0.95% | 4.8% |