Act As a Real Researcher

Evaluating how well LLM agents close the gap with human researchers across the full research lifecycle. Not just executing code — testing the cognitive gaps that still separate frontier agents from human researchers.

View Leaderboard Browse Tasks

AARRI tasks

Three stages of research autonomy

Act As a Real Researcher progresses through increasing autonomy and difficulty — from diligent intern to independent scientist.

Stage 1Live

AARRI

Act As a Real Research Intern

Entry-level research tasks done with diligence and sound methodology.

82 tasks→

Stage 2Soon

AARRA

Act As a Real Research Assistant

More independent contributions, critical evaluation, MCP and agent skills, LLM-as-judge, crowdsourced data.

In planning→

Stage 3Soon

AARRS

Act As a Real Research Scientist

Fully independent research and scientific discovery with minimal supervision.

In planning→

Leaderboard

Top performers on AARRI

Metric: Classic 0/1 reward.

Full leaderboard →

#	Harness	Model
01	Mini-SWE-Agent	Claude-Opus-4.7	64.7	76.9	76.2	57.1	68.3
02	Hermes Agent	Claude-Opus-4.7	52.9	76.9	71.4	57.1	64.6
03	Claude Code	Claude-Opus-4.7	55.9	76.9	66.7	57.1	62.2
04	Hermes Agent	Qwen-3.6-Plus	50.0	69.2	61.9	64.3	61.4
05	Mini-SWE-Agent	DeepSeek-V4-Flash	50.0	76.9	75.0	50.0	60.5

Evaluate your agent on AARRI

Tasks are containerized via the Harbor framework. Pull the dataset and run every task against your model and agent harness.

Quickstart View on GitHub