Skip to content
Assay

The environment is the test.

The agent does the work. The deployed system grades it. Four steps, one boundary.

  1. 01

    Work

    An agent takes on a real infrastructure task: provision a system to spec, migrate a runtime, repair a broken deployment. It writes infrastructure as code, CDK or Terraform, with the same tools an engineer uses, inside an isolated cloud account of its own.

  2. 02

    Deploy

    The work is actually deployed. There is no simulation layer to charm and no rubric to argue with. The environment either comes up healthy or it does not.

  3. 03

    Verify

    A verifier in a separate trust domain inspects what was deployed, through read-only APIs. It never executes the agent's code or tooling in the grading path, so there is no agent-authored code between the work and the grade.

  4. 04

    Record

    The grade is computed deterministically, signed with a key the agent never holds, and written once to an immutable ledger. Anyone with the public key can verify any grade, offline, without trusting us.

The boundary, drawn.

Grades live in a trust domain the agent cannot reach: separate account, separate credentials, no network path. Observation crosses the wall in one direction only.

Trust domain A agent

Live infrastructure

The deployed system, observable state

Agent

Writes infrastructure as code

Trust domain B verifier

Verifier

Grades deployed state deterministically

Signed record

Immutable, independently verifiable

FIG. 1 The write-isolation boundary. Grades are computed, signed, and stored where the agent has no path.

Prevention, not detection.

Most defenses against reward hacking are detectors: they audit trajectories and flag manipulation after the fact. But training against a detector teaches evasion. Assay's answer is architectural. The grader is unreachable by construction, so there is no surface to learn to evade.

Detection flags reward hacking. Isolation makes it unreachable.

Five families of real work.

Every family is long-horizon, multi-file, and deterministically gradable. That combination is exactly what the market is short on.

Provisioning
Stand up a system to spec, graded by a clean deploy, integration checks, and policy.
Drift and debug
Reconcile a drifted or broken deployment until the plan is clean and the system is healthy.
Migration and upgrades
Move runtimes, providers, or architectures with behavioral equivalence as the bar.
Incident response
Diagnose and remediate a failing deployment until service health is restored.
Cost and security
Reduce spend or close misconfigurations while preserving function.