AARRI-Bench
Leaderboard
Agent Overall Performance by Task Category. Best results are bold, second-best are underlined.
Top configuration
Claude-Opus-4.7
Mini-SWE-Agent
68.3
Agent harnesses
3
Claude Code, Hermes Agent, Mini-SWE-Agent
Evaluated configs
16
Classic 0/1 reward
| # | Harness | Model | |||||
|---|---|---|---|---|---|---|---|
| 01 | Mini-SWE-Agent | Claude-Opus-4.7Anthropic | 64.7 | 76.9 | 76.2 | 57.1 | 68.3 |
| 02 | Hermes Agent | Claude-Opus-4.7Anthropic | 52.9 | 76.9 | 71.4 | 57.1 | 64.6 |
| 03 | Claude Code | Claude-Opus-4.7Anthropic | 55.9 | 76.9 | 66.7 | 57.1 | 62.2 |
| 04 | Hermes Agent | Qwen-3.6-PlusAlibaba | 50.0 | 69.2 | 61.9 | 64.3 | 61.4 |
| 05 | Mini-SWE-Agent | DeepSeek-V4-FlashDeepSeek | 50.0 | 76.9 | 75.0 | 50.0 | 60.5 |
| 06 | Mini-SWE-Agent | Qwen-3.6-PlusAlibaba | 44.1 | 76.9 | 71.4 | 64.3 | 59.8 |
| 07 | Hermes Agent | MiniMax-M2.7MiniMax | 44.1 | 69.2 | 61.9 | 57.1 | 58.1 |
| 08 | Hermes Agent | DeepSeek-V4-FlashDeepSeek | 55.9 | 46.2 | 76.2 | 50.0 | 57.1 |
| 09 | Mini-SWE-Agent | MiniMax-M2.7MiniMax | 55.9 | 69.2 | 60.0 | 42.9 | 56.8 |
| 10 | Mini-SWE-Agent | Kimi-K2.6Moonshot AI | 59.4 | 61.5 | 52.6 | 50.0 | 56.4 |
| 11 | Claude Code | Qwen-3.6-PlusAlibaba | 50.0 | 69.2 | 63.2 | 50.0 | 56.3 |
| 12 | Claude Code | MiniMax-M2.7MiniMax | 47.1 | 69.2 | 66.7 | 50.0 | 56.1 |
| 13 | Hermes Agent | Claude-Sonnet-4.6Anthropic | 47.1 | 53.8 | 66.7 | 50.0 | 54.4 |
| 14 | Claude Code | GPT-5.3 CodexOpenAI | 47.1 | 53.8 | 65.0 | 50.0 | 53.1 |
| 15 | Claude Code | Claude-Sonnet-4.6Anthropic | 50.0 | 61.5 | 61.9 | 35.7 | 52.4 |
| 16 | Claude Code | Kimi-K2.6Moonshot AI | 45.5 | 61.5 | 65.0 | 35.7 | 51.3 |
Click any column header to sort. Bold marks the best result per column; underlined marks second-best.