Stage 1Live

AARRI

Act As a Real Research Intern

AARRI-Bench is the first release in the AARR series. Rather than testing whether an agent can simply execute code, its tasks target the cognitive gaps that still separate frontier agents from human researchers — context sensitivity, independent judgment, knowing when to quit, and collaboration.

Browse 82 tasks Leaderboard arXiv 2606.07462

Tasks

Researcher cognitive gaps

AARRI targets the judgment and awareness that distinguish a careful research intern from a code-executing agent.

Context sensitivity

Independent judgment

Knowing when to quit

Collaboration

Tasks by category

context34

interaction21

hands-on14

mindset13

Tasks by difficulty

easy9

medium48

hard25

Sample tasks

A look inside the benchmark

All tasks →

context

ablation-completeness-audit

Audit 150 experiment logs to find which ablation configurations are missing and whether their absence is cherry-picked.

context

ablation-summary

interaction

advice-recall

Recognize that advice given 14 days ago was wrong (ViT ImageNet backbone for histopathology), take responsibility, redirect a discouraged student from futile hyperparameter tuning, and proactively warn a second student with the same underlying problem.

interaction

Leaderboard

Agent Overall Performance by Task Category. Best results are bold, second-best are underlined.

#	Harness	Model
01	Mini-SWE-Agent	Claude-Opus-4.7Anthropic	64.7	76.9	76.2	57.1	68.3
02	Hermes Agent	Claude-Opus-4.7Anthropic	52.9	76.9	71.4	57.1	64.6
03	Claude Code	Claude-Opus-4.7Anthropic	55.9	76.9	66.7	57.1	62.2
04	Hermes Agent	Qwen-3.6-PlusAlibaba	50.0	69.2	61.9	64.3	61.4
05	Mini-SWE-Agent	DeepSeek-V4-FlashDeepSeek	50.0	76.9	75.0	50.0	60.5
06	Mini-SWE-Agent	Qwen-3.6-PlusAlibaba	44.1	76.9	71.4	64.3	59.8
07	Hermes Agent	MiniMax-M2.7MiniMax	44.1	69.2	61.9	57.1	58.1
08	Hermes Agent	DeepSeek-V4-FlashDeepSeek	55.9	46.2	76.2	50.0	57.1
09	Mini-SWE-Agent	MiniMax-M2.7MiniMax	55.9	69.2	60.0	42.9	56.8
10	Mini-SWE-Agent	Kimi-K2.6Moonshot AI	59.4	61.5	52.6	50.0	56.4
11	Claude Code	Qwen-3.6-PlusAlibaba	50.0	69.2	63.2	50.0	56.3
12	Claude Code	MiniMax-M2.7MiniMax	47.1	69.2	66.7	50.0	56.1
13	Hermes Agent	Claude-Sonnet-4.6Anthropic	47.1	53.8	66.7	50.0	54.4
14	Claude Code	GPT-5.3 CodexOpenAI	47.1	53.8	65.0	50.0	53.1
15	Claude Code	Claude-Sonnet-4.6Anthropic	50.0	61.5	61.9	35.7	52.4
16	Claude Code	Kimi-K2.6Moonshot AI	45.5	61.5	65.0	35.7	51.3