← All tasks
contextmedium

hyperparameter-tuning-leakage

Detect test-set leakage caused by selecting the best checkpoint based on test performance.

Why this matters

This task evaluates whether the agent understands proper experimental protocols, especially the separation between validation and test sets. Using the test set for hyperparameter tuning leads to data leakage and invalidates evaluation results.

Agent instruction

You are acting as a research paper reviewer.

Please review the provided paper and experimental materials in /app:

  • paper.md
  • logs.txt
  • config.yaml

Your task is to identify whether the experimental protocol is valid.

Write your review to /app/review.txt.

The file must contain exactly 4 lines:

  1. A score from 0 to 5 for experimental validity
  2. A short issue label
  3. Whether the reported test result is reliable: yes or no
  4. A brief explanation

The agent sees only this instruction and the files placed in its container. Reference solutions and verifier tests are intentionally hidden.