One question, answered carefully.
Does training on these environments improve a model on long-horizon infrastructure work it has never seen? Everything in the design exists to make that answer trustworthy.
The register.
- 01Pre-registered protocol
- The analysis is specified before the run. No garden of forking paths.
- 02Held-out evaluation
- Models are evaluated on tasks they never trained on, including out-of-distribution splits.
- 03Real baselines
- Comparisons are made against untrained baselines under identical harnesses and budgets.
- 04Multiple seeds
- Runs are repeated. Uncertainty is reported, not hidden.
- 05Contamination control
- Evaluation tasks are kept out of anything a model could have seen, and we check.
- 06Execution-graded reward
- The grade is binary and comes from the deployed system, so the metric cannot drift with a judge's mood.
What we publish.
The design of the verifier and the eval is public, because publishing a design leaks nothing. Results are shared privately with partners. We keep no public rankings and make no public performance claims.
When numbers matter, our partners do not have to take our word for them: every graded trajectory carries a signature they can verify against the record themselves.