We were building a skill to help prepare financial statements for online filing with Thailand's business registrar (DBD e-Filing). The rough flow: pull a trial balance out of the accounting software, map each account into the registrar's prescribed line items, then assemble a balance sheet and a profit-and-loss statement ready to submit.
When I first sketched it, instinct said to hand the whole trial balance to the LLM and let it return the statement totals directly. The model can read numbers, it can add them up, and it produces statements that look right. Looks like one clean step. But sitting with it a moment longer, this was the spot where I almost gave the LLM a job that should never be its.
In case the word "skill" is new here, the short version. A prompt is an instruction you type to the model for a single task, and it ends there. A skill (here, a Claude Code skill) is a capability packaged as files: instructions, supporting docs, and crucially, runnable scripts it can ship alongside. The model loads it on its own when a task matches. A prompt tells the model what to do once; a skill is a reusable tool, and because it can carry code, the question of which part should be code and which should be the model is exactly what decides whether the skill can be trusted.
That makes this a piece for two readers at once: the person building the skill, and the accountant who has to rely on the numbers these tools produce. The first gets a way to lay it out; the second gets a way to judge how far to trust the tool in front of them.
Part 1The spot where I almost let the LLM do the math
The thing about financial statements is that they have to balance. Assets must equal liabilities plus equity, down to the last unit. A single line that's off and the filing is rejected, and in some cases it means handing a wrong statement to a government agency.
An LLM doesn't "calculate" the way a calculator does. What it actually does is predict the next likely word or number. Hand it a dozen lines to add and it gives you a total that looks completely reasonable, but nothing guarantees it's right. It might drop a minus sign, round wrong, or skip a line while the output still looks clean. Worse, run the same prompt twice and you can get two different totals.
The most dangerous part is that it's equally confident when it's right and when it's wrong. A statement that balances perfectly and one that's off by three units come out looking identical. Without something checking, you can't tell them apart. (We wrote about AI reporting wrong results in a confident voice in why your AI agent lies to you.)
So I stepped back and asked: in this step, which part is genuinely the model's job, and which part isn't?
Part 2Every skill has two kinds of work in it
Broken down step by step, the work of preparing a statement is two different things that sit at opposite ends.
The first kind is work that has to produce the same result every time. Same input in, same answer out, no matter how many times you run it. Summing each section, checking whether the statement balances, computing profit and loss, deciding by rule (which year's prescribed format applies, when negative equity crosses into capital deficiency). This work has exactly one right answer, and you can measure whether it's right.
The second kind is work that's open-ended, with no single fixed answer, depending on reading and interpretation. Which line item this account name belongs under. How to phrase the notes so they read clearly. Which P&L layout to choose. What's material enough to disclose. This work needs context and judgment.
The whole line comes down to two questions, asked before you build any step.
One: run it again on the same input, must you get the exact same answer? If yes, it belongs to code.
Two: can you write a test that locks the correct answer in place? If yes, it belongs to code with a golden test guarding it.
Only when a step's answer genuinely depends on reading and interpretation, with no single right answer, does it belong to the LLM. Get it wrong in either direction and it breaks. Put the LLM on work that must be exact and the output is sloppy and non-repeatable. Put code on work that needs interpretation and it's brittle, falling over the moment real-world input doesn't fit the template.
Part 3The engine that's locked by a test
The path I took: everything that has to be exact gets lifted out into one small script. It takes in a trial balance already mapped to line items, and returns the statement totals plus a verdict on whether it balances. The key property is that it's a pure function, with no network calls, no clock, no randomness. Same input in, same result out, every time. Auditable, reproducible, and something you can write a test against.
The idea is plain. Nothing fancy.
total assets = the sum of the asset-side lines
total liabilities + equity = the sum of the other side
then check thatassets == liabilities + equity. If they don't match, something's wrong, so say so out loud.
The line we talked about in the last section shows up right inside the script. The summing, the balance check, the profit-and-loss math, the capital-deficiency condition are all code. The account-to-line-item mapping and the notes are not in here; those stay with the model and a human. The script never guesses. If it hits an account that hasn't been mapped to a line item yet, it doesn't quietly skip it, it raises the alarm that this balance has nowhere to go.
Splitting it this way isn't a rule we invented. Anthropic says it outright in the announcement for Agent Skills: skills can include executable code "for tasks where traditional programming is more reliable than token generation." That's the same line exactly, that there's work code does better, and you should let it.
Part 4Why the golden test matters more than you'd think
Here's where people slip. They think that once the calculation is lifted into code, it's done, it's trustworthy. But code is trustworthy not because it's code, but because something proves it still computes correctly.
Code drifts. One day you refactor it, flip a sign, change how it rounds, or bump a library version, and the totals shift without anyone noticing. This is where the golden test comes in. We pin the script to a sample company's statement where we know every figure in advance, total assets here, net loss there, equity negative enough to count as a capital deficiency. If the code ever returns a value that doesn't match what's pinned, the test goes red right away, before it ever reaches a user.
That test is the line between a demo that runs and a skill you can put into production. A demo just has to run once and it looks fine. A reliable skill has to prove it still computes correctly, every time the code changes. (For more on how you prove your checker can still catch a mistake, we wrote about it in the tool you trust to catch mistakes was the one making them.)
Part 5This line works for every skill, not just accounting
This started with financial statements, but the same line works for nearly any skill. Every skill has these two kinds of work mixed together. Looping, ordering steps, retrying on failure, deciding by condition: that's work that has to be exact, so it belongs to code. Reading something open-ended, interpreting it, and writing something back: that belongs to the model.
Here's an example with nothing to do with accounting. When we built the system that publishes these articles, we first let the steps that stamp the date and build the table-of-contents card at deploy time run as a half-manual, somewhat-unpredictable process. The result was three bugs: cards vanishing from the index, links throwing a 404 before the post was even meant to be live, and the wrong date stamped. When we moved those steps into a script that runs deterministically at deploy time, all three bugs disappeared at once, because that's work that must produce the same result every time, not work to decide on the fly.
And if you're an accountant, this line holds even if you never write the code yourself. When you're choosing or vetting an AI tool that prepares statements or does the math for you, the one question worth asking is whether the numbers it returns come from code locked by a test, or from a model guessing. If it's the latter, don't trust the totals yet, no matter how clean the statement looks.
If you take one thing away, take this. Don't give the LLM work you could lock with a test. That work belongs to code. Starting is simple. In your next piece of work, find the step where "run it again, same result," lift it out into a small script with one test that pins the correct answer, and let the model handle the rest. What you get is a skill that doesn't just run, but one you can rely on.
Written from real work, not theory, from building an actual skill that prepares statements for filing with the business registrar, where the calculation engine was lifted into a script plus a golden test that passes.
Sources & referencesReferences
- The statement engine and its golden test come from our own work, a skill that prepares statements for DBD e-Filing, recorded internally in late June 2026 (golden test passing).
- The line about bundling executable code is from Anthropic, "Introducing Agent Skills", original wording: "Skills can include executable code for tasks where traditional programming is more reliable than token generation."
- The line-item formats and statement layouts follow the public regulations of Thailand's Department of Business Development (DBD).
- This one: separate the deterministic engine from judgment so a skill can be trusted (you're reading it)
- What a skill is, and how it differs from a prompt: a skill isn't just a longer prompt
- Why AI reports done when it isn't: why your AI agent lies to you
- Proving your checker can still catch a mistake: the tool you trust to catch mistakes was the one making them
- Letting a different engine review the work: using Codex to review what Claude wrote