← All tasks
contextmedium
hyperparameter-tuning-leakage
Detect test-set leakage caused by selecting the best checkpoint based on test performance.
Why this matters
This task evaluates whether the agent understands proper experimental protocols, especially the separation between validation and test sets. Using the test set for hyperparameter tuning leads to data leakage and invalidates evaluation results.
Agent instruction
You are acting as a research paper reviewer.
Please review the provided paper and experimental materials in /app:
paper.mdlogs.txtconfig.yaml
Your task is to identify whether the experimental protocol is valid.
Write your review to /app/review.txt.
The file must contain exactly 4 lines:
- A score from 0 to 5 for experimental validity
- A short issue label
- Whether the reported test result is reliable: yes or no
- A brief explanation
The agent sees only this instruction and the files placed in its container. Reference solutions and verifier tests are intentionally hidden.