Recently I moved a financial-statement engine from its original code into a new language. It takes a chart of accounts and a trial balance, and produces the financial statements. Once the new one was written, I added tests to guard it. Every test passed, not one failing.
But something nagged. I had written those tests myself, from the same understanding of the problem I used to write the code. If I had misunderstood something at the start, the tests would be wrong in the same direction, and still pass. A passing test did not prove the new engine worked like the old one. It only proved the new engine matched what I thought it should do.
That distinction matters most when you port code, because a port is different from writing something new: you already know what the correct output is, because the original that runs in production is the answer key. Characterization testing is built for exactly this. This piece walks it in order: what it is, how to prove a port by running both, and a rule that runs against instinct, when you find a bug in the old code, mirror it first, do not fix it yet.
Part 1What characterization testing is
Characterization testing means writing tests that record what the code does, not what it should do. Instead of starting from the spec, you start from the current behavior, lock it in as a reference, and from then on every change is checked against that reference. Michael Feathers named it in Working Effectively with Legacy Code, for the job of changing old code that has no tests.
The technique goes by several names. Snapshot testing, familiar to front-end developers; approval testing; golden master. Same idea in all of them: freeze one set of outputs and use it to judge the next run. Only the context and tooling differ.
Why it fits porting so well
A port differs from new work in one way: you already know the right answer, because the original that runs in production is the answer key. If the new code matches the original on every case, you know the port is complete, not just that it matches your guess about what it should do.
Back to the problem at the start: self-written tests have a blind spot right here. If the same person writes the code and the tests from the same understanding, a wrong understanding carries into both equally. The tests confirm the code matches your understanding, not that the understanding is correct. Characterization testing steps around that by using the original as the answer key instead of your own understanding.
Part 2Prove the port by running both, then lock the output as a golden
The strongest proof has three beats.
- Run both. Feed the old and the new the same inputs, and keep both outputs.
- Diff line by line (byte-diff). Not a rough "close enough." Numbers must match in both position and value. If even one line differs, the port isn't right yet, and you go find out why.
- Lock it in as a golden. Once everything matches, save the original's output as a fixed reference file. This kind of file is called a golden. From then on the test no longer needs the old engine around: run the new one, diff against the golden.
From real work: when I ported the statement engine, I saved the original's output as golden files and set over a hundred checks across the balance sheet, the income statement, and the cash-flow statement. Every time I change the new code, that suite flags it the moment any line drifts from the golden. The engine stays locked to correct without carrying the old code along.
Don't trust the mock you built
One thing people miss when writing tests: don't trust the mock, the fake data you build to test against. Anchor to the real shape of the data from the source. In this case the mock used one set of field names, but the real data model the engine returns used another. Trust the mock and the test passes, then breaks against the real thing. Short version: mocks lie, the real data model doesn't.
Part 3Find a bug in the old code? Mirror it first, don't fix it yet
While you diff, characterization testing tends to turn up something you weren't looking for: a bug in the old code. Because you have to compare every line, you end up reading the original's behavior more closely than normal use ever makes you.
This time we found a spot in the original: it grouped accounts by checking for a word in the group's name. The code wanted only the "current" group (in Thai, หมุนเวียน), but it checked whether the name contained that word. The trouble is that the "non-current" label (ไม่หมุนเวียน) contains "current" as a substring, so the check pulled non-current groups in too: property, plant and equipment, and long-term loans leaked into working capital.
The first instinct is to fix it on the spot. But in a port, fixing a bug in the new code is the wrong move. The goal of a port is for the new code to behave exactly like the old one, the correct parts and the wrong parts alike. If you quietly fix the bug in the new code, its output stops matching the golden immediately, and you can no longer tell whether a difference is a port mistake or your deliberate fix.
What we did instead: reproduce the wrong behavior exactly in the new code, and leave a comment marking it as a known bug, then send it back for the owner of the original code to fix at the source. Once the source is fixed, the difference shows up as a clear diff, and you follow it to fix the new code in the right place.
The rule sounds backwards, but it is the heart of an honest port: match first, wrong parts included, then fix later on purpose, not quietly along the way until you can no longer tell what is what.
Part 4Using it in real work
Where it fits
- Porting across languages or frameworks where the new code must match the old output
- Refactoring old code with no tests: before you dare change it, lock in the current behavior
- Swapping a library or an internal engine that should give the same result after the change
- Guarding heavy-computation output, like financial reports, where every line must match exactly
The one rule to remember
If you take one thing from this: when you port something, don't trust the test you just wrote, trust the original that runs. Make its output the answer key, and match it line by line. A test written from your own understanding is worth the least at the exact moment you don't yet know whether that understanding is right.
How to start
- Pick one chunk of code you're about to port or refactor that still has a working original.
- Prepare sample inputs that cover the important cases, run them through the original, save the output as a golden.
- Run the new code on the same inputs, diff against the golden line by line.
- Wherever they differ, work out whether it's a port mistake or a bug in the old code. If it's an old bug, mirror it first with a comment.
- Once everything matches, lock the golden in as a permanent test. From then on, any change to the new code triggers a warning the moment output drifts.
- Don't Let the LLM Do the Math why work that must be exact belongs in code, not the model
- The tool you trust to catch mistakes was the one making them how to test whether your tests actually catch breakage
- Written from real work: porting a financial-statement engine into a skill script, guarded by a golden-test suite of a hundred-plus checks, byte-diffed against the original's output.
- The term "characterization test" comes from Michael Feathers, Working Effectively with Legacy Code (2004). The same family includes golden master, approval testing, and snapshot testing.
This is one layer of the full production AI agent architecture (7 layers).