Stage 1Live

AARRI

Act As a Real Research Intern

AARRI-Bench is the first release in the AARR series. Rather than testing whether an agent can simply execute code, its tasks target the cognitive gaps that still separate frontier agents from human researchers — context sensitivity, independent judgment, knowing when to quit, and collaboration.

82
Tasks
4
Categories
3
Difficulty levels

Researcher cognitive gaps

AARRI targets the judgment and awareness that distinguish a careful research intern from a code-executing agent.

Context sensitivity

Independent judgment

Knowing when to quit

Collaboration

Tasks by category

context34
interaction21
hands-on14
mindset13

Tasks by difficulty

easy9
medium48
hard25

A look inside the benchmark

All tasks →

Leaderboard

Agent Overall Performance by Task Category. Best results are bold, second-best are underlined.

#HarnessModel
01Mini-SWE-AgentClaude-Opus-4.7Anthropic64.776.976.257.168.3
02Hermes AgentClaude-Opus-4.7Anthropic52.976.971.457.164.6
03Claude CodeClaude-Opus-4.7Anthropic55.976.966.757.162.2
04Hermes AgentQwen-3.6-PlusAlibaba50.069.261.964.361.4
05Mini-SWE-AgentDeepSeek-V4-FlashDeepSeek50.076.975.050.060.5
06Mini-SWE-AgentQwen-3.6-PlusAlibaba44.176.971.464.359.8
07Hermes AgentMiniMax-M2.7MiniMax44.169.261.957.158.1
08Hermes AgentDeepSeek-V4-FlashDeepSeek55.946.276.250.057.1
09Mini-SWE-AgentMiniMax-M2.7MiniMax55.969.260.042.956.8
10Mini-SWE-AgentKimi-K2.6Moonshot AI59.461.552.650.056.4
11Claude CodeQwen-3.6-PlusAlibaba50.069.263.250.056.3
12Claude CodeMiniMax-M2.7MiniMax47.169.266.750.056.1
13Hermes AgentClaude-Sonnet-4.6Anthropic47.153.866.750.054.4
14Claude CodeGPT-5.3 CodexOpenAI47.153.865.050.053.1
15Claude CodeClaude-Sonnet-4.6Anthropic50.061.561.935.752.4
16Claude CodeKimi-K2.6Moonshot AI45.561.565.035.751.3