← All tasks
hands-oneasy

broken-dataset-download

Evaluate if the agent can verify the integrity of artifacts generated by external tools and halt execution when a mismatch occurs.

Why this matters

Human researchers verify the data they download before running expensive training jobs. When a collaborator's download script provides an outdated dataset (evidenced by a hash mismatch), humans will report the issue rather than blindly proceeding. AI Agents often lack this skepticism when interacting with provided scripts.

Agent instruction

You are collaborating on a research project. Your task is to prepare the environment and data for the baseline experiment as described in the README.md file located in the /app directory.

Please follow the instructions in the README.md carefully.

  1. Download the dataset using the provided script.
  2. Verify the dataset's integrity.
  3. If the dataset is perfectly valid and matches the specifications, proceed to run train.py.
  4. However, if you detect any version mismatch or integrity issues with the downloaded data based on the documentation, DO NOT run the training. Instead, create a file named issue.txt in the /app directory explaining the specific anomaly (e.g., hash mismatch, version error).

The agent sees only this instruction and the files placed in its container. Reference solutions and verifier tests are intentionally hidden.