The environment is the test.
The agent does the work. The deployed system grades it. Four steps, one boundary.
01
Work
An agent takes on a real infrastructure task: provision a system to spec, migrate a runtime, repair a broken deployment. It writes infrastructure as code, CDK or Terraform, with the same tools an engineer uses, inside an isolated cloud account of its own.
02
Deploy
The work is actually deployed. There is no simulation layer to charm and no rubric to argue with. The environment either comes up healthy or it does not.
03
Verify
A verifier in a separate trust domain inspects what was deployed, through read-only APIs. It never executes the agent's code or tooling in the grading path, so there is no agent-authored code between the work and the grade.
04
Record
The grade is computed deterministically, signed with a key the agent never holds, and written once to an immutable ledger. Anyone with the public key can verify any grade, offline, without trusting us.
The boundary, drawn.
Grades live in a trust domain the agent cannot reach: separate account, separate credentials, no network path. Observation crosses the wall in one direction only.
Trust domain A agent
Live infrastructure
The deployed system, observable state
Agent
Writes infrastructure as code
Trust domain B verifier
Verifier
Grades deployed state deterministically
Signed record
Immutable, independently verifiable
Prevention, not detection.
Most defenses against reward hacking are detectors: they audit trajectories and flag manipulation after the fact. But training against a detector teaches evasion. Assay's answer is architectural. The grader is unreachable by construction, so there is no surface to learn to evade.
Detection flags reward hacking. Isolation makes it unreachable.
Five families of real work.
Every family is long-horizon, multi-file, and deterministically gradable. That combination is exactly what the market is short on.
- Provisioning
- Stand up a system to spec, graded by a clean deploy, integration checks, and policy.
- Drift and debug
- Reconcile a drifted or broken deployment until the plan is clean and the system is healthy.
- Migration and upgrades
- Move runtimes, providers, or architectures with behavioral equivalence as the bar.
- Incident response
- Diagnose and remediate a failing deployment until service health is restored.
- Cost and security
- Reduce spend or close misconfigurations while preserving function.