AARRI-Bench is live · arXiv 2606.07462

Act As a Real Researcher

Evaluating how well LLM agents close the gap with human researchers across the full research lifecycle. Not just executing code — testing the cognitive gaps that still separate frontier agents from human researchers.

82
AARRI tasks
4
Categories
3
Series stages
aarr@bench: ~/AARRI-bench

Three stages of research autonomy

Act As a Real Researcher progresses through increasing autonomy and difficulty — from diligent intern to independent scientist.

Top performers on AARRI

Metric: Classic 0/1 reward.

Full leaderboard →
#HarnessModel
01Mini-SWE-AgentClaude-Opus-4.764.776.976.257.168.3
02Hermes AgentClaude-Opus-4.752.976.971.457.164.6
03Claude CodeClaude-Opus-4.755.976.966.757.162.2
04Hermes AgentQwen-3.6-Plus50.069.261.964.361.4
05Mini-SWE-AgentDeepSeek-V4-Flash50.076.975.050.060.5

Evaluate your agent on AARRI

Tasks are containerized via the Harbor framework. Pull the dataset and run every task against your model and agent harness.