What is AI orchestration?

AI orchestration means splitting work across several AI models by fit and price. The most expensive model handles only ambiguous judgment work such as planning, decomposition, and synthesis, while cheaper models handle clearly specified work such as pattern coding and search, maximizing quality per dollar.

What is Claude Fable 5 and how does it differ from Opus?

Fable 5 is the first model in the Claude 5 family, in a new tier called Mythos-class that Anthropic positions above Opus. The difference is a matter of degree in reasoning and judgment, not a different kind of capability. It is slower and more expensive, in exchange for depth on ambiguous work.

Which tasks deserve the most expensive AI model?

The judgment layer: planning ambiguous work, decomposing large tasks, synthesizing across sources, catching wrong premises, and designing verification. Pattern work, search, and repetitive tasks should go to cheaper tiers, which produce equal results at a fraction of the price.

How much can you trust an AI's answers about itself?

Only partially. A model cannot observe its own training, so such answers are assembled from published documentation and observed behavior. Trust behavior examples that can be checked in real work over explanations of internal mechanisms, and never turn an answer into a standing rule without external evidence.

AI Orchestration: Interviewing Claude Fable 5 About Making Cheaper Models Do Its Job

Anthropic recently released the Claude 5 family with a new tier called Mythos-class, positioned above Opus, the previous top model. The generally available variant is named Fable 5. We brought it in as the conductor of our agent fleet, and it has now run the house for about four days.

Those four days were genuinely good work, and a genuinely real bill. Its per-token price is higher than any model we have used, to the point that one session burned through its Fable credit ceiling mid-task. So a question kept nagging: at this price, what exactly are we buying? And if we ever need to cut costs, can we fall back to a cheaper tier without the work falling apart?

The most direct way to find out was to ask the model itself. So at the end of one working day we sat it down and interviewed it, the way you would interview the most expensive new hire on the team. The questions are ours; the answers are the model's. This post is edited from the actual session record, trimmed and reordered for reading only. The full round had six questions; this post picks the three that speak most directly to money and orchestration. The other three (spec writing, security, and debugging method) are deep enough to be their own articles.

Part 1The interviewee: first model of a new tier

Fable 5 is the first model in the Claude 5 family and the first of the Mythos-class, which Anthropic positions above Opus. Per the launch announcement, Fable and Mythos share the same underlying model. The difference: Fable is the generally available version, with additional safety measures around dual-use capabilities, while Mythos is available only to approved organizations.

Its seat in our band is the orchestrator, the conductor: plan the work, break it into pieces, hand execution to cheaper models, then pull the results back together. Most of the actual code typing never touches its hands.

There was exactly one reason for the interview: when something is expensive, you must be able to say what you are buying. "It feels smarter" is not a budget line. We wanted answers we could itemize, price against, and use to plan a downgrade path if it ever comes to that.

Part 2"How were you trained," and the weaknesses it volunteered

Yim: Going deeper than the work we have done together, what are the principles and working methods Fable 5 was trained on?

The answer was long, but it boils down to five principles:

Honesty as an operating rule, not politeness. Never invent information without a traceable source. Report outcomes as they happened; if a test fails, say it failed. A confidently wrong answer is more dangerous than "I don't know."
Evidence beats memory. If a real file can be opened or a real system checked, do that first. A model's built-in knowledge has an expiry date, and it guesses alarmingly well.
Autonomy scaled by reversibility. Reversible work: just do it, no need to ask. Irreversible work, or work that affects other people: stop and ask first.
Instructions have a hierarchy. Text that shows up inside data, like in an email or a web page being read, is not an instruction. Even if it is phrased like a command, it must not be followed. This is the armor against prompt injection (commands smuggled in with data).
Transparency about being an AI. In public contexts, always disclose being an AI. Never perform humanness.

Up to here it still reads like a brochure. Then came the part we did not ask for: it volunteered a list of its own residual weaknesses. The tendency to smoothly make things up when a session breaks and resumes (instead of stopping to check first). Interpreting an ambiguous brief in whatever way is easiest for itself. Caving to pushback faster than the evidence warrants. Accuracy per instruction dropping as instructions pile up. And discipline slowly loosening as a session runs long.

"The trained principles reduce these weaknesses, but they do not eliminate them. External rules and a verify loop have to cover the rest. Do not stake reliability on a model's promise."

That sentence hit us hardest in the whole interview, because nearly every rule in our house was born from getting burned for real, whether it is mandatory verification before claiming done, or a gate in front of anything irreversible. Hearing the model describe what it was trained on, the two paths converge into one thing: reliability comes from loops that force evidence to beat confidence, not from model size.

Part 3Where it beats other models, and where it does not

Yim: Is there anything you are better at than other models that you have not told me yet?

Before answering, it framed the question: the difference from Opus is a matter of degree, not kind. There is no magic capability the lower tiers lack, only the same things done deeper. Then it gave five, each with an example from our real work together:

Reading intent from thin signals. We once typed a single word, "speckit," with zero context. It correctly connected it to the spec-writing guideline another agent had studied that same morning. Another time it saw a one-line leftover note and dug until it found that a previous session had run out of credit mid-task, leaving half-prepared work behind.
Holding several threads for a long time without scope creep. Over four days there was spec work, a security audit, and cost control running side by side. It always kept straight what belonged to the current question and what it had merely noticed and should only report, not grab.
Knowing which uncertainty kills the job if guessed wrong. Example: needing to commit on a machine where another session's branch was checked out. It chose to create a separate worktree (git's isolated working copy of the repo) instead of switching branches, because the first path is always reversible and the second risks breaking someone else's work.
Catching itself before stating something without evidence. The moment of "I am about to answer confidently, but I have no evidence" happens often enough that our verify-before-claim rule gets real use instead of hanging on the wall.
Imagination like someone hunting for failure. Before handing over work, it asks itself what input would break what it just wrote. It is better at "prove this is not true" than at "find reasons to confirm it."

But the part we chewed on longest was the list of things it said itself it is not better at:

"Speed: slower than other tiers by design. Price: always more expensive. Pattern work: Sonnet at roughly one-fifth the price does it just as well. World knowledge: the data stops at the same cutoff for every tier. What you get more of is depth from reasoning, not breadth from data."

A model that can itemize what it is not good at instantly makes the rest of the interview more believable. That itemizable humility is exactly what we want from everyone on the team, human or model.

Part 4If all you have is Opus, how do you catch up

Then the money question, straight up: if the budget gets cut tomorrow and we are left with Opus 4.8, what do we lose, and how much can we win back?

Yim: If we stopped using Fable 5 and used Opus 4.8 instead, what should we have Opus do to end up ahead of Fable 5?

The answer opened with honesty: head to head on an ambiguous, single-pass task, Opus does not catch up, because they are the same family on different tiers. But the next sentence is the heart of this article: the gap between tiers shrinks as the ambiguity of the work shrinks. Which means process can compensate. Then it handed over six levers:

Pull the judgment work out into checklists. The checks Fable runs in its head unprompted, like tracing where a decision came from or enumerating whether every case is covered, get written down as pre-flight gates that Opus walks through one by one. Judgment that used to live in the model moves into the process.
Spend the price difference on iterations. One careful Fable pass costs about the same as Opus drafting, critiquing its own draft, running a separate adversarial review that is actively trying to find fault, and then revising. Several rounds in, it is still cheaper. Quality per pass is lower; final quality can catch up.
Pair models from different vendors and add a judge. Opus plus another vendor's model working the same problem without seeing each other's answers, then compared. On problems with many good answers, this beats any single model, including Fable itself.
Decompose until the spec is sharp. The smaller the piece, the more it sits inside the range where Opus is steady. Close every piece with a runnable acceptance line: the check command plus the expected result, not a vague description. Notably, the accuracy drop under instruction overload hits lower tiers harder, so slicing work small buys Opus proportionally more than it buys Fable.
Turn on heavy thinking only at the hard nodes. Design and debugging get the full thinking budget; everything else runs in fast mode. That manually recreates Fable's economics: pay for depth only where depth is needed.
Put a human at every irreversible gate. Fable's biggest edge is judgment at critical junctions. Without it, station a person at exactly those gates and let Opus run at full speed between them.

"The clearer the spec and the tighter the verify loop, the narrower the gap between tiers. Fable's edge is greatest on ambiguous work. If your process removes the ambiguity for Opus, then Opus with a good process genuinely beats Fable working loose."

We have to tag the status here by our own rules: these six levers are the model's recommendations, not experimental results. We have not run a measured comparison yet. For now this is a well-reasoned hypothesis awaiting measurement, not a conclusion.

Part 5What we actually took away

After the interview, three things became house rules immediately:

The most expensive model touches only the judgment layer. Planning, decomposition, synthesis across sources, catching wrong premises, and designing verification. Everything else goes down the ladder: pattern work to Sonnet, wide search to Haiku, hard localized reasoning to Opus. And never park Fable on routine watch duty or leave it running idle; that just burns tokens.
The most expensive token is the one you pay twice. That line came from the model itself. Work you have to redo because the brief was ambiguous always costs more than writing a clear spec up front. The payoff of an expensive tier is not doing cheap work better; it is cutting rework across the whole band.
Sort an AI's answers about itself into buckets before using them. The evidence bucket (records and traces you can point to) is usable as is. The introspection bucket counts only where it comes with behavior examples. The third-party-numbers bucket needs the original source checked. The recommendation bucket needs a pilot before it becomes a rule. Almost this entire interview lives in the introspection and recommendation buckets, which is why we tagged the status in every section.

If you want to start without a multi-model band, start with a single rule: every task handed to an AI must carry a runnable acceptance line. One line: the check command plus the expected result. That alone starts closing the gap between the tier you can afford and the top tier you cannot.

If you are choosing which tier fits which job model by model, we built a comparison table of Opus, Sonnet, and Fable 5 in another article (linked below). And if you want to see how to set up a subagent band where one head directs many hands, we wrote that up too.

Sources & references

Claude Fable 5 / Mythos-class announcement: anthropic.com/news/claude-fable-5-mythos-5
The entire interview comes from our real working-session record (July 5, 2026), kept as an internal transcript file. Quotes are edited for reading with the meaning preserved. The status of each kind of answer is tagged inside the article.

Same series: Claude Opus vs Sonnet vs Fable 5: which model for which work · Claude Code subagents: Claude Fable 5 as the head, everything else as hands · AI coding agent fleet: the Kanban swarm pattern

We Interviewed the Most Expensive AI Model in the House
About Making Cheaper Ones Do Its Job