The environment is the test.

The agent does the work. The deployed system grades it. Four steps, one boundary.

01
Work
An agent takes on a real infrastructure task: provision a system to spec, migrate a runtime, repair a broken deployment. It writes infrastructure as code, CDK or Terraform, with the same tools an engineer uses, inside an isolated cloud account of its own.
02
Deploy
The work is actually deployed. There is no simulation layer to charm and no rubric to argue with. The environment either comes up healthy or it does not.
03
Verify
A verifier in a separate trust domain inspects what was deployed, through read-only APIs. It never executes the agent's code or tooling in the grading path, so there is no agent-authored code between the work and the grade.
04
Record
The grade is computed deterministically, signed with a key the agent never holds, and written once to an immutable ledger. Anyone with the public key can verify any grade, offline, without trusting us.

The boundary, drawn.

Grades live in a trust domain the agent cannot reach: separate account, separate credentials, no network path. Observation crosses the wall in one direction only.

Trust domain A agent

Live infrastructure

The deployed system, observable state

Agent

Writes infrastructure as code

Trust domain B verifier

Verifier

Grades deployed state deterministically

Signed record

Immutable, independently verifiable

FIG. 1 The write-isolation boundary. Grades are computed, signed, and stored where the agent has no path.

Prevention, not detection.

Most defenses against reward hacking are detectors: they audit trajectories and flag manipulation after the fact. But training against a detector teaches evasion. Assay's answer is architectural. The grader is unreachable by construction, so there is no surface to learn to evade.

Detection flags reward hacking. Isolation makes it unreachable.

Five families of real work.

Every family is long-horizon, multi-file, and deterministically gradable. That combination is exactly what the market is short on.

Provisioning: Stand up a system to spec, graded by a clean deploy, integration checks, and policy.
Drift and debug: Reconcile a drifted or broken deployment until the plan is clean and the system is healthy.
Migration and upgrades: Move runtimes, providers, or architectures with behavioral equivalence as the bar.
Incident response: Diagnose and remediate a failing deployment until service health is restored.
Cost and security: Reduce spend or close misconfigurations while preserving function.

How we measure it →Design notes →

The environment is the test.

Work

Deploy

Verify

Record

The boundary, drawn.

Prevention, not detection.

Five families of real work.