Codex as a Headless Code Reviewer for Claude Code

We were closing a security hole. It let you run a root-level shell through Discord, which is too dangerous to leave open. The fix was straightforward: restrict that worker's permissions down to read-only, no running commands, no editing files. Wrote it, read it back once, looked fine.

But when we handed the same code to a second model to review, it caught in seconds that in the permission list we'd just written, the shell command showed up on both the "allow" side and the "deny" side at once. The hole we thought was sealed was actually still half-open, because two config lines contradicted each other. Deleting the conflicting line was the fix, and only then was the hole really closed.

The scary part is we wrote that fix ourselves, read it back, and still didn't see it. Our head had already decided "this is the code that closes the hole," so our eyes slid right past the line that contradicted it. What caught it wasn't us getting more careful. It was a second model that didn't carry that assumption.

Part 1Why authors can't review their own work

When AI writes code for you, it doesn't just type it out. It holds a reason for every line, why it wrote it that way. Ask it to go back and read its own work, and that same set of reasons is still there, so it reads right past the wrong parts. From its point of view, every line already has an explanation.

We've written before about why AI gets things confidently wrong. The model isn't trying to lie. It's just good at finding reasons to back up what it already did. Point that skill at reviewing its own work and it becomes the lawyer for its own code, not the reviewer.

The fix isn't telling the same model to "review more carefully." The problem isn't carefulness, it's the assumption it's carrying. What works better is handing the code to another model, one that didn't write it and isn't carrying the original reasons, so it sees what the author missed.

This is the same principle human teams have used for ages: the person who writes and the person who reviews shouldn't be the same person. Move that into work where AI writes most of the code and it still holds, maybe more than before, because AI writes faster and more confidently than people do.

Part 2The second catch: a deploy that would ship the future early

The other time the reviewer saved us was on the script that controls deploys for this blog. The system is built to publish one post a day, off a queue. We wrote the part that copies the whole folder up to the server before deploying, and figured that was that.

When we ran the diff past Codex, it pointed at something we hadn't seen at all. If the queue holds several posts that are approved but not yet due, copying the whole folder and deploying once would push every future post live at the same time. The one-a-day cadence we'd designed would break instantly.

In hindsight it's obvious. But while writing it, we were focused on "make today's post deploy," and never thought about the other files riding along in the same folder. The reviewer marked it P2. We fixed it by holding back the not-yet-due posts before the deploy, then putting them back once it finished.

The thing to notice: these two bugs are completely different. One is security, the other is deploy logic. But they share one trait. The code ran clean, no errors. And without a second pair of eyes, both would have shipped to production.

Part 3The second model isn't always right

By now this sounds like a second model is the answer to everything. It isn't. There was another round where the reviewer raised something as a P2 in our message routing. We read it, checked it against the existing tests, and the real path wasn't broken the way it warned. What it caught was a very narrow case, and arguably correct as intended anyway. In the end we didn't change anything.

This is the part that matters. The second reviewer isn't the boss, it's a second opinion. Its job is to flag where you should stop and look, not to decide for you. Everything it raises, you still weigh against the real code and the real tests yourself. Taking it all on faith is about as dangerous as not reviewing at all.

So we treat what the reviewer flags as a list to go verify, not a list to go fix. Only the ones that survive verification get changed. Two of the three times here, it found something real; the third, it over-warned. That's a usable rate, as long as a human is still the one confirming.

Part 4Put a second model on every diff before merge

The principle from all three is one sentence: before every merge, hand the diff to a model that didn't write it. The rest is how to make that happen every time, not just when you remember.

What we set up is only a few things. First, it runs headless: one command, done. The reviewer reads the repo, reads the diff, runs the tests itself, and hands back findings point by point with a severity. Because if you have to open a chat and talk to it every time, eventually you skip it.

Second, the review never downgrades the model to save money. For other tasks we pick the model by difficulty. But for review we force the strongest model with the highest reasoning, always. The whole point of this step is to catch what the author missed; letting the reviewer miss it too is pointless.

The second model we use is OpenAI's Codex, because it ships a review-the-diff command already, and more importantly because it comes from a different vendor than the one that wrote the code. A different vendor means a different set of blind spots, which is the entire reason we put the two against each other. How we wrap it into our own script, where we pin the model and config, that's the part we're turning into our own tooling. But the principle and the shape are these.

If you want to try it, start with one spot. Next time, before you merge work the AI wrote for you, don't take the author's word for it. Open a second model, hand it the diff, and ask one question: what would the author have missed? That alone is enough to see why the writer and the reviewer shouldn't be the same one.

Sources & references

Both cases in this piece (the security hole with the shell command on both the allow and deny side, June 2026, and the deploy bug that would publish queued posts early, June 23, 2026) come from real work on our own fleet, caught by running codex review on real diffs, with fixes committed in both cases. Nothing is hypothetical.
The headless codex review command (reads the repo + diff + runs the tests itself) is something we tested ourselves with codex-cli (June 2026) on both Mac and our server.

The AI That Writes Code
Can't See Its Own Bugs

Part 1Why authors can't review their own work

Part 2The second catch: a deploy that would ship the future early

Part 3The second model isn't always right

Part 4Put a second model on every diff before merge

Get new posts and free resources first

Join the conversation