What is AARR-bench?

AARR (Act As a Real Researcher) is a benchmark series for evaluating LLM agents across the research lifecycle. The core question: what exactly are the gaps between AI agents and real human researchers? The series progresses through three stages of increasing autonomy — intern (AARRI), assistant (AARRA), and scientist (AARRS).

Quickstart

AARRI tasks are containerized via the Harbor framework. You can run the full benchmark against any model and agent harness.

Option 1 — Run from Harbor Hub (no clone)

Install the Harbor CLI and pull the dataset directly from the registry.

uv tool install harbor
harbor run -d aarr/aarri-bench -m "<model>" -a "<agent>"

Option 2 — Run a local copy

Use this mode if you want to inspect or modify tasks, or run a subset of them.

git clone https://github.com/AARR-bench/AARRI-bench.git
cd AARRI-bench
uv tool install harbor
harbor run -p ./tasks -m "<model>" -a "<agent>"

How tasks are scored

Each task is a self-contained folder with an instruction, a Docker environment, a hidden reference solution, and a verifier. The agent sees only the instruction and the files placed in its container. The verifier runs pytest assertions against the agent's output and writes a binary reward — 1 for success, 0 for failure. The leaderboard reports the classic 0/1 reward metric aggregated by task category.