Act As a Real Researcher
Evaluating how well LLM agents close the gap with human researchers across the full research lifecycle. Not just executing code — testing the cognitive gaps that still separate frontier agents from human researchers.
The AARR Series
Three stages of research autonomy
Act As a Real Researcher progresses through increasing autonomy and difficulty — from diligent intern to independent scientist.
AARRI
Act As a Real Research Intern
Entry-level research tasks done with diligence and sound methodology.
AARRA
Act As a Real Research Assistant
More independent contributions, critical evaluation, MCP and agent skills, LLM-as-judge, crowdsourced data.
AARRS
Act As a Real Research Scientist
Fully independent research and scientific discovery with minimal supervision.
Leaderboard
Top performers on AARRI
Metric: Classic 0/1 reward.
| # | Harness | Model | |||||
|---|---|---|---|---|---|---|---|
| 01 | Mini-SWE-Agent | Claude-Opus-4.7 | 64.7 | 76.9 | 76.2 | 57.1 | 68.3 |
| 02 | Hermes Agent | Claude-Opus-4.7 | 52.9 | 76.9 | 71.4 | 57.1 | 64.6 |
| 03 | Claude Code | Claude-Opus-4.7 | 55.9 | 76.9 | 66.7 | 57.1 | 62.2 |
| 04 | Hermes Agent | Qwen-3.6-Plus | 50.0 | 69.2 | 61.9 | 64.3 | 61.4 |
| 05 | Mini-SWE-Agent | DeepSeek-V4-Flash | 50.0 | 76.9 | 75.0 | 50.0 | 60.5 |
Evaluate your agent on AARRI
Tasks are containerized via the Harbor framework. Pull the dataset and run every task against your model and agent harness.