Leaderboard | SWE-rebench

34 problems from 28 repositories selected within the current time window.

You can adjust time window, modifying the problems' release start and end dates. Evaluations highlighted in red may be potentially contaminated, meaning they include tasks that were created before the model's release date. If a selected time window is wider than the range of task dates a model was evaluated on, the model will appear at the bottom of the leaderboard with N/A values.

Rank	Model	Resolved Rate (%)	Resolved Rate SEM (±)	pass@5 (%)
1	gpt-5-2025-08-07-medium	29.4%	2.08%	38.2%
2	gpt-5-2025-08-07-high	26.5%	0.93%	32.4%
3	Qwen3-Coder-480B-A35B-Instruct	22.4%	0.72%	32.4%
4	Claude Sonnet 4	20.6%	0.93%	23.5%
5	o3-2025-04-16	20.6%	0.93%	26.5%
6	gpt-5-2025-08-07-minimal	20.0%	2.35%	35.3%
7	o4-mini-2025-04-16	18.2%	1.10%	26.5%
8	gpt-4.1-2025-04-14	17.6%	1.32%	23.5%
9	Qwen3-235B-A22B-Instruct-2507	17.1%	0.59%	20.6%
10	horizon-beta	17.1%	0.59%	20.6%
11	gemini-2.5-pro	16.5%	1.18%	20.6%
12	Claude Sonnet 3.5	15.9%	0.72%	20.6%
13	horizon-alpha	15.3%	1.10%	20.6%
14	gpt-4.1-mini-2025-04-14	15.3%	1.10%	17.6%
15	Qwen3-235B-A22B-Thinking-2507	15.3%	1.44%	17.6%
16	DeepSeek-R1-0528	15.3%	1.10%	17.6%
17	gpt-4.1-2025-04-14	14.7%	0.93%	20.6%
18	Qwen3-235B-A22B no-thinking	14.1%	1.10%	23.5%
19	gemini-2.5-flash	14.1%	1.44%	23.5%
20	DeepSeek-V3	14.1%	0.59%	17.6%
21	Qwen3-Coder-30B-A3B-Instruct	14.1%	1.10%	17.6%
22	DeepSeek-V3-0324	12.9%	1.18%	20.6%
23	DeepSeek-V3-0324	12.4%	1.10%	20.6%
24	Qwen3-235B-A22B thinking	10.0%	1.50%	14.7%
25	gemini-2.0-flash	10.0%	1.50%	14.7%
26	Qwen3-32B no-thinking	9.4%	0.59%	14.7%
27	Llama-3.3-70B-Instruct	9.4%	0.59%	14.7%
28	Qwen3-32B thinking	8.2%	1.71%	17.6%
29	Devstral-Small-2505	8.2%	1.71%	11.8%
30	Llama-4-Maverick-17B-128E-Instruct	7.1%	2.20%	20.6%
31	gemini-2.0-flash	7.1%	1.76%	17.6%
32	Qwen2.5-72B-Instruct	5.9%	1.32%	11.8%
33	Llama-4-Scout-17B-16E-Instruct	5.3%	1.10%	14.7%
34	gemma-3-27b-it	5.3%	1.71%	8.8%
35	Qwen2.5-Coder-32B-Instruct	0.6%	0.59%	2.9%
36	gpt-4.1-nano-2025-04-14	0.0%	0.00%	0.0%
37	gemini-2.5-flash-preview-05-20 no-thinking	N/A	N/A	N/A
38	gemini-2.5-flash-preview-05-20 no-thinking	N/A	N/A	N/A

News

[2025-08-12]: Added new models to the leaderboard: gpt-5-medium-2025-08-07, gpt-5-high-2025-08-07 and gpt-5-minimal-2025-08-07.
[2025-08-02]: Added new models to the leaderboard: Qwen3-Coder-30B-A3B-Instruct, horizon-beta.
[2025-07-31]:
- Added new models to the leaderboard: gemini-2.5-pro, gemini-2.5-flash, o4-mini-2025-04-16, Qwen3-Coder-480B-A35B-Instruct, Qwen3-235B-A22B-Thinking-2507, Qwen3-235B-A22B-Instruct-2507, DeepSeek-R1-0528 and horizon-alpha.
- Deprecated models: gemini-2.5-flash-preview-05-20 no-thinking.
- Updated demo format: tool calls are now shown as distinct assistant and tool messages.
[2025-07-11]: Released Docker images for all leaderboard problems and published a dedicated HuggingFace dataset containing only the problems used in the leaderboard.
[2025-07-10]: Added models performance chart and evaluations on June data.
[2025-06-12]: Added tool usage support, evaluations on May data and new models: Claude Sonnet 3.5/4 and o3.
[2025-05-22]: Added Devstral-Small-2505 to the leaderboard.
[2025-05-21]: Added new models to the leaderboard: gpt-4.1-mini-2025-04-14, gpt-4.1-nano-2025-04-14, gemini-2.0-flash and gemini-2.5-flash-preview-05-20.

SWE-rebench: A Continuously Evolving and Decontaminated Benchmark for Software Engineering LLMs

News