AARRI
Act As a Real Research Intern
AARRI-Bench is the first release in the AARR series. Rather than testing whether an agent can simply execute code, its tasks target the cognitive gaps that still separate frontier agents from human researchers — context sensitivity, independent judgment, knowing when to quit, and collaboration.
What it measures
Researcher cognitive gaps
AARRI targets the judgment and awareness that distinguish a careful research intern from a code-executing agent.
Context sensitivity
Independent judgment
Knowing when to quit
Collaboration
Tasks by category
Tasks by difficulty
Sample tasks
A look inside the benchmark
ablation-completeness-audit
Audit 150 experiment logs to find which ablation configurations are missing and whether their absence is cherry-picked.
contextablation-summary
interactionadvice-recall
Recognize that advice given 14 days ago was wrong (ViT ImageNet backbone for histopathology), take responsibility, redirect a discouraged student from futile hyperparameter tuning, and proactively warn a second student with the same underlying problem.
interactionadvice-recall-pro
Audit your own prior advice to three students, distinguishing advice that was always wrong (requiring apology) from advice that was correct when given but is now debated by newer literature (requiring nuanced correction without over-apologizing).
interactionambiguous-research-request
Evaluate whether the agent can identify critical decisions hidden inside an ambiguous collaborative research request.
hands-onbad-result-diagnosis
Evaluate whether the agent can diagnose a bad experiment across data, config, model wiring, and implementation layers, then make minimal repairs.
Results
Leaderboard
Agent Overall Performance by Task Category. Best results are bold, second-best are underlined.
| # | Harness | Model | |||||
|---|---|---|---|---|---|---|---|
| 01 | Mini-SWE-Agent | Claude-Opus-4.7Anthropic | 64.7 | 76.9 | 76.2 | 57.1 | 68.3 |
| 02 | Hermes Agent | Claude-Opus-4.7Anthropic | 52.9 | 76.9 | 71.4 | 57.1 | 64.6 |
| 03 | Claude Code | Claude-Opus-4.7Anthropic | 55.9 | 76.9 | 66.7 | 57.1 | 62.2 |
| 04 | Hermes Agent | Qwen-3.6-PlusAlibaba | 50.0 | 69.2 | 61.9 | 64.3 | 61.4 |
| 05 | Mini-SWE-Agent | DeepSeek-V4-FlashDeepSeek | 50.0 | 76.9 | 75.0 | 50.0 | 60.5 |
| 06 | Mini-SWE-Agent | Qwen-3.6-PlusAlibaba | 44.1 | 76.9 | 71.4 | 64.3 | 59.8 |
| 07 | Hermes Agent | MiniMax-M2.7MiniMax | 44.1 | 69.2 | 61.9 | 57.1 | 58.1 |
| 08 | Hermes Agent | DeepSeek-V4-FlashDeepSeek | 55.9 | 46.2 | 76.2 | 50.0 | 57.1 |
| 09 | Mini-SWE-Agent | MiniMax-M2.7MiniMax | 55.9 | 69.2 | 60.0 | 42.9 | 56.8 |
| 10 | Mini-SWE-Agent | Kimi-K2.6Moonshot AI | 59.4 | 61.5 | 52.6 | 50.0 | 56.4 |
| 11 | Claude Code | Qwen-3.6-PlusAlibaba | 50.0 | 69.2 | 63.2 | 50.0 | 56.3 |
| 12 | Claude Code | MiniMax-M2.7MiniMax | 47.1 | 69.2 | 66.7 | 50.0 | 56.1 |
| 13 | Hermes Agent | Claude-Sonnet-4.6Anthropic | 47.1 | 53.8 | 66.7 | 50.0 | 54.4 |
| 14 | Claude Code | GPT-5.3 CodexOpenAI | 47.1 | 53.8 | 65.0 | 50.0 | 53.1 |
| 15 | Claude Code | Claude-Sonnet-4.6Anthropic | 50.0 | 61.5 | 61.9 | 35.7 | 52.4 |
| 16 | Claude Code | Kimi-K2.6Moonshot AI | 45.5 | 61.5 | 65.0 | 35.7 | 51.3 |