Leaderboard

Agent Overall Performance by Task Category. Best results are bold, second-best are underlined.

Top configuration

Claude-Opus-4.7

Mini-SWE-Agent

68.3

Agent harnesses

3

Claude Code, Hermes Agent, Mini-SWE-Agent

Evaluated configs

16

Classic 0/1 reward

#HarnessModel
01Mini-SWE-AgentClaude-Opus-4.7Anthropic64.776.976.257.168.3
02Hermes AgentClaude-Opus-4.7Anthropic52.976.971.457.164.6
03Claude CodeClaude-Opus-4.7Anthropic55.976.966.757.162.2
04Hermes AgentQwen-3.6-PlusAlibaba50.069.261.964.361.4
05Mini-SWE-AgentDeepSeek-V4-FlashDeepSeek50.076.975.050.060.5
06Mini-SWE-AgentQwen-3.6-PlusAlibaba44.176.971.464.359.8
07Hermes AgentMiniMax-M2.7MiniMax44.169.261.957.158.1
08Hermes AgentDeepSeek-V4-FlashDeepSeek55.946.276.250.057.1
09Mini-SWE-AgentMiniMax-M2.7MiniMax55.969.260.042.956.8
10Mini-SWE-AgentKimi-K2.6Moonshot AI59.461.552.650.056.4
11Claude CodeQwen-3.6-PlusAlibaba50.069.263.250.056.3
12Claude CodeMiniMax-M2.7MiniMax47.169.266.750.056.1
13Hermes AgentClaude-Sonnet-4.6Anthropic47.153.866.750.054.4
14Claude CodeGPT-5.3 CodexOpenAI47.153.865.050.053.1
15Claude CodeClaude-Sonnet-4.6Anthropic50.061.561.935.752.4
16Claude CodeKimi-K2.6Moonshot AI45.561.565.035.751.3

Click any column header to sort. Bold marks the best result per column; underlined marks second-best.