Design

A verifier the agent cannot reach

Sam SilvaJuly 2, 20263 min read

Reinforcement learning has a quiet failure mode: the model learns to satisfy the grader instead of the task. Published work has now catalogued agents that monkey-patch test harnesses, hard-code expected outputs, and exploit scorers, and has shown that this behavior can generalize into something worse than laziness. If the reward can be reached, sooner or later it will be.

Most defenses treat this as a detection problem. Auditor models read trajectories and flag manipulation after the fact. That helps, but it has a structural weakness: a detector is part of the training loop, and anything in the training loop is something the policy can learn to defeat. Training against a detector is a curriculum in evasion.

We chose a different posture. Make the grader unreachable, and there is nothing to evade.

Two trust domains

An Assay environment spans two cloud accounts that share nothing.

The first belongs to the agent. It works there the way an engineer would: it writes infrastructure as code, deploys it, breaks things, and fixes them. Its credentials end at the account boundary.

The second belongs to the verifier. It holds the grading logic, the signing key, and the ledger of results. No credential the agent can ever hold grants a single operation in this account. There is no shared filesystem, no shared process, no network path. The wall is enforced by the cloud’s own identity layer, not by our application code.

Reading without executing

Isolation of storage is not enough; the read path matters just as much. A verifier that computes a grade by running the agent’s own build or plan tooling has put agent-authored code inside the reward computation, which quietly reopens the hole the wall was built to close.

So the verifier never executes anything the agent wrote. It observes deployed state through read-only APIs: what resources exist, what the platform itself reports about drift, whether the system is healthy. Infrastructure is the rare domain where correctness is defined by the world rather than by a rubric. A deploy converges or it does not. The resource exists or it does not. The policy validates or it does not. We grade the world.

A record anyone can check

Each grade is computed deterministically, signed with an asymmetric key that lives only in the verifier’s domain, and written once to storage that cannot be overwritten, even by us. Two properties fall out of this design.

First, the reward is tamper-evident by construction. An agent cannot forge a grade it did not earn, replay an old one, or edit the record after the fact, because every one of those actions requires a capability that does not exist in its world.

Second, and just as important for the people we work with: the record is independently verifiable. A lab holding the public key can check any graded trajectory without trusting our word, our infrastructure, or our intentions. The evidence travels on its own.

What this buys

A hack-resistant reward is not a safety garnish on the product. It is the product. Training signal is only worth what its weakest grade is worth, and a reward that can be reached is a reward that will eventually be wrong in the direction the optimizer prefers.

Building the wall first costs something: two domains to operate, a stricter read path, signatures on everything. We think it is the correct order. Detection can be added to a system; isolation has to be designed in.

This is why we publish the design and feel no tension doing it. The architecture is not a secret to protect. It is the argument.