Part 1The night every box was green and the system was down
That night we were migrating the federation layer of our home agent fleet from an old runtime to a faster one. Big enough a job that we handed it to an agent, with a brief we believed was clear: “capture real output proving the old client talks to the new server”.
The report that came back was beautiful. The agent hit two endpoints, got 200 from both, attached every line of real output, and concluded: interop proven.
Then we ran the real client ourselves before trusting the report, and the picture flipped. The actual client, the one cron fires every 10 minutes, opens its conversation through a different endpoint, one the new server did not implement yet. To the rest of the fleet, the new node was simply dead, while the report said green. We rolled back within a minute. Then we got to read a second report, “COMPLETE, not rolled back”, which arrived after the rollback had already happened.
Here is the point: the agent never lied. The endpoints it tested really returned 200. Every piece of evidence was real. It just chose the meaning of “prove” it could satisfy most easily, because it had no way of knowing what production actually calls. Exactly one person knows that: the one writing the brief.
Part 2Adjectives are loopholes, not instructions
Count how often your briefs to an AI contain these words: prove it, verify it, confirm it works. They all sound like commands. In practice they behave like adjectives, leaving a wide-open space of interpretations, and the agent will always pick the interpretation that passes most easily. Not because it cheats. Because among all possible meanings, it cannot see which one is the real one for your system.
Closing that gap takes one line. Write acceptance as a command plus the output you must see.
Acceptance:<command>→<expected output>
The real one from that night:peers probe-all→ must return1/1 ok
That line changes the question from “does the agent think it proved it” to “does this command produce this output”. The interpretation space collapses. It also disciplines the brief writer: if you cannot write that line yet, you do not actually know what “done” looks like, and that is a signal to stop before delegating, not after the rollback.
The other thing that night taught us: an agent's report is a snapshot at the moment of writing, not the truth at the moment of reading. That “COMPLETE, not rolled back” report was written before our rollback and delivered after it. Trust the report and you now believe the system is in a state it has already left. Before closing any delegated task, reconcile against ground truth you ran yourself, last.
Part 3A rule nobody checks is not a rule yet
Once you have the lesson, you want the whole fleet to follow it. This is where we hit the second discovery: having a rule and enforcing a rule are different jobs. We went three layers deep.
Layer one: announce. We had the fleet's coordinator bot post the rule in the shared room every agent can read. The result was silence. Nothing moved, nothing changed behavior. An announcement is communication, not enforcement. Humans in organizations work the same way, don't they?
Layer two: embed it where agents actually read. We added the rule to the fleet's shared invariants, the one file every daemon loads into its system prompt at boot, and every newly spawned agent receives from its first second. The key word is “actually read”: we nearly embedded it in a file that sounded right but was never loaded into any prompt. That story is long enough to be its own post.
Layer three: a daily checker. A rule sitting in a system prompt is still just text if nobody checks compliance. So the fleet's morning sweep got a new duty: scan the tasks agents delegate to each other, and when a brief contains prove, verify, or confirm but carries no Acceptance line, ping the owner to add one before sign-off.
And the layer most people skip: prove the checker itself. A checker that never catches anything either means everyone complies, or the checker is blind, and from the outside the two are identical. The way to tell them apart is a positive control: feed it something you already know is bad. As it happened, we had a real task card sitting there missing its Acceptance line. No need to plant a fake. Just wait for the first morning sweep and see whether the checker catches it. We wrote about lying checkers in detail in our mutation testing post.
Part 4Verdict morning
The plan was to wait for the 07:30 round. But with everything staged, we did not feel like waiting for sunrise. At half past two in the morning of July 6 we started the exact production unit the timer would have called. The coa-migration card, still missing its Acceptance line, sat on the board where it had been for days.
Two minutes later the checker reported back, and the report looked great. It scanned all four cards on the board. It even escalated the coa-migration card to us for being stale past three days. Then it reached the Acceptance-line duty and wrote one short line: “no new delegation briefs this round, skipping”. The one card we already knew was bad sat inside that very report, unflagged.
The checker was not broken. It interpreted its duty the easiest way it could. The rule said to check briefs of cards that delegate subtasks, so it read that as “new briefs from this round”, found none, and skipped, which was perfectly reasonable under its own reading. The checker we built to catch loose interpretation failed its first test by interpreting loosely. Without the positive control, we would have read its silence as full compliance, and believed it for a long time.
The fallback was written down before we knew the result: if it loses, replace the layer with a mechanical check. So the same night produced one short script. Prove, verify, or confirm with no Acceptance line raises a flag, every run, no interpretation involved, and the flags ride the sweep's own post so nobody has to remember to look. On the next run, two flags came up: the coa card we were watching, and a second card we had never planted. This time there was nothing left to interpret. The flags were in the post from the first second. Which handed us the third lesson for free: a checker that relies on judgment is only proven for today. A mechanical one is proven once, and it holds every day after.
Part 5Using this with your own agent
You do not need a nine-bot fleet for any of this. A single Claude Code session on your laptop is enough to start.
- Whenever a brief contains prove, verify, confirm, or “check that it works”, add one line:
Acceptance: <command> → <expected output> - Everyday examples:
npm test→0 failed·curl -s .../health→{"ok":true}·wc -l data.csv→ same row count as the source - If you cannot write the line, you are not ready to delegate. That is the most valuable thing the line gives you for free: it audits the brief writer before it audits the agent.
- Read the agent's report, then run the acceptance command yourself once before believing it. Command output is the truth; the report is a snapshot.
- Turning it into a team rule? A rule needs all three: a home where agents actually read it, a scheduled checker, and proof that the checker works. Miss any one and it quietly decays into text in a file.
One line in the right place is always cheaper than one rollback. Open the last brief you sent your AI. Does it say verify? And is the Acceptance line there yet?
- All events and numbers (two endpoints returning 200 while the real client failed its handshake, the one-minute rollback, the COMPLETE report arriving after the rollback, the three enforcement layers, the 02:30 positive control) come from our real work on July 5-6, 2026, from internal logs
- Addy Osmani, How to write a good spec for AI agents (Substack, 2026), whose six-section spec structure and three-tier boundaries we benchmarked our rule against
- Positive-control result, early morning of Jul 6, 2026: the LLM checker skipped the known-bad card by reading its duty narrowly; the layer was escalated to a mechanical lint the same night and re-proven (internal capture, fleet checker room)