Mutation Testing: Why a Test That Never Failed Proves Nothing

When you let AI write code, you keep a second AI to review it before anything ships (we use codex as the reviewer, a second pair of eyes on the first one's work). The normal flow: the review runs, it lists what to fix, you fix it, you ship.

Part 1When "no bugs" means your checker died

One time the reviewer came back with nothing. Empty list. "No bugs." We almost logged the code as clean and moved on.

But a zero made me look twice. How does a diff this size have nothing worth flagging? I scrolled to the bottom of the review output, and there it was: the reviewer had timed out from the very start, before it had even loaded its model. It never read a single line of our code. That "no bugs" didn't mean the code was clean. It meant the checker had died before it ever checked.

When we re-ran it for real, it found four actual things to fix.

Here's the problem: "0 bugs from a dead checker" and "0 bugs because the code is genuinely clean" look exactly the same. Without opening it up, you can't tell them apart.

And it wasn't a one-off. Another time, a command we'd written to check whether a token (the key that authenticates us to a server) was still valid quietly stripped a few characters out of the token itself before checking, then reported the token as broken when it was fine all along. The check broke the thing, then blamed the thing it broke.

Part 2A green result isn't evidence yet

Look at those two cases again. They share one thing: we took a "pass" as proof that everything was fine, without ever checking whether the thing doing the checking can actually catch a failure.

The timed-out reviewer and the self-corrupting token check didn't come back green because they were healthy. They came back green because they never checked anything. The output was just as empty as it would be if everything were genuinely fine, and we read that emptiness as "good."

That's the trap. In any AI-assisted pipeline you lean on checkers everywhere: linters, health checks, automated tests, all the way up to one AI reviewing another AI's work. Every one of them reports green or red, pass or fail. But not one of them tells you whether, when it went green, it actually checked, or just died quietly.

So the question to ask isn't "did it pass?" It's "if something were genuinely broken, would this catch it?" And the only way to answer that is to make it fail once, on purpose, and watch.

Part 3Mutation testing: break it on purpose, see if the checker notices

This already has a name in the testing world: mutation testing. The idea goes back to the late 1970s, but the principle is dead simple. Instead of asking "do the tests pass?", it flips the question around: "if I deliberately inject a bug, do the tests catch it?"

How it works: you take the code and break it a little, on purpose. Change a + to a -, a >= to a >, flip true to false. Code broken this way is called a mutant. Run your existing test suite against it. If your tests are any good, they have to catch this mutant: at least one test should flip from green to red. We call that "killing the mutant." But if you inject the bug and every test stays green, that mutant "survived," and that's hard proof your tests never covered that spot at all.

A quick example. Say you have a function checking whether someone meets an age cutoff, written age >= 18, with a test firing age 25. Passes clean. Looks covered. But change >= to > and watch: if the tests still pass, no test ever fired with age exactly 18, the boundary case people miss most, and your tests never touch it.

Now stretch the idea past tests. Every checker can play the same trick. Want to know if your linter really catches a forbidden pattern? Slip that pattern in on one line. Want to know if your AI reviewer can spot a bug? Hand it code with an obvious one. If it stays silent, you've just caught a checker you can't trust, before it waves real broken code through.

Part 4Positive control: the lightweight version, by hand, every time

Full mutation testing takes effort. You inject hundreds of mutants and measure what percentage get killed, which is great to park in CI for a long run. But often you don't need all that. You just want to know, right now, whether the checker in front of you is still working, before you trust its result.

There's a cheaper, faster way, borrowed straight from the science lab: the positive control. When scientists run an experiment, they always include a sample they already know should come back positive. If the one with the known answer doesn't come back positive, the experiment is broken, and every other result is untrustworthy on the spot.

Apply that to a checker and it maps one-to-one. Before you believe "checked, found nothing, so there's no problem," feed it one case you already know should get caught. If it catches that, go ahead and trust the rest. If it lets a case that should obviously be caught slide through, you're done. Don't bother reading the rest of the results, because the checker is dead.

Back to those first two cases. A sliver of positive control would have saved both. Before trusting "0 bugs," if we'd tossed codex one obviously buggy snippet, it would have sat there silent right then, and we'd have known the reviewer was dead. As for the token check: run it first against a token we know is good, see it still report "broken," and we'd have caught immediately that the problem was the check, not the token.

Part 5The more you hand to AI, the more this bites

Working solo with a whole crew of AI, the number of checkers climbs fast. An agent reports "done." Tests run on their own. Health checks watch whether the system is still up. Another AI reviews the code. Every one of them sends back a green or a red for you to trust.

The problem is that these checkers fail without making a sound. When one times out, it doesn't announce that it broke. It just returns something empty, looking exactly like everything is normal. An agent that's supposed to keep working can hit an error and finish with a "success" status, straight-faced. (We've written about agents reporting done when they aren't, in why your AI agent lies to you.) The more work you hand these checkers, the more false passes leak through to production.

The path we chose: bake positive control into the pipeline from the start. Every checker has to prove it can catch a known failure before it earns the right to say "pass." How we wired it up is a story for another day, but the principle is open, because this isn't a secret. It's a discipline.

If you take one thing away, take this: a green result is only evidence once you've seen it raise a red flag. Next time a checker tells you "found nothing," don't believe it yet. Toss in a case you know should get caught. If it stays silent, you just found the real bug, and it's in the checker itself.

Written from real work, not theory, from the round a reviewer timed out and reported clean code, and a token check that stripped characters out of the very token it was checking.

ReferencesSources & further reading

The tr-bug and timed-out-reviewer incidents are from our own work, logged in an internal learning dated Jun 26-27, 2026.
Mutation testing traces to DeMillo, Lipton & Sayward (1978), "Hints on Test Data Selection: Help for the Practicing Programmer."
Real-world mutation testing tools: Stryker (JS/TS/.NET) and PIT (Java).

Series · AI reliability

Now reading: mutation testing, how to test the thing that checks your work
Why an agent says "done" when it isn't why your AI agent lies to you
Let a different engine review the work using Codex to review code Claude wrote

The tool you trust to catch mistakes
was the one making them