Publication
The AARRI-Bench paper
AARRI-Bench introduces the first stage of the AARR series, evaluating whether LLM agents can act as real research interns.
BibTeX
arXiv:2606.07462@article{aarri2026,
title = {AARRI-Bench: Act As a Real Research Intern},
author = {AARR-bench Team},
journal = {arXiv preprint arXiv:2606.07462},
year = {2026},
url = {https://arxiv.org/abs/2606.07462}
}Abstract
AARRI-Bench is the first release in the AARR series. Rather than testing whether an agent can simply execute code, its tasks target the cognitive gaps that still separate frontier agents from human researchers — context sensitivity, independent judgment, knowing when to quit, and collaboration. Tasks are containerized via the Harbor framework and run in isolated Docker environments, with verifiers that score each agent run against a hidden reference. Rather than testing raw code execution, AARRI focuses on context sensitivity, independent judgment, knowing when to quit, and collaboration.