Publication

The AARRI-Bench paper

AARRI-Bench introduces the first stage of the AARR series, evaluating whether LLM agents can act as real research interns.

Read on arXiv ↗Code & data

BibTeX

arXiv:2606.07462

@article{wang2026act,
  title={Act As a Real Researcher: A Suite of Benchmarks Evaluating Frontier LLMs and Agentic Harnesses in Research Lifecycle},
  author={Wang, Jiayu and Lv, Weijiang and Fu, Bowen and Fu, Jing and Song, Jiayi and Zhang, Lingyu and Xue, Lanxuan and Chen, Luodi and Xin, Zepeng and Li, Kaiyu and others},
  journal={arXiv preprint arXiv:2606.07462},
  year={2026}

Abstract

AARRI-Bench is the first release in the AARR series. Rather than testing whether an agent can simply execute code, its tasks target the cognitive gaps that still separate frontier agents from human researchers — context sensitivity, independent judgment, knowing when to quit, and collaboration. Tasks are containerized via the Harbor framework and run in isolated Docker environments, with verifiers that score each agent run against a hidden reference. Rather than testing raw code execution, AARRI focuses on context sensitivity, independent judgment, knowing when to quit, and collaboration.