Build a Trustworthy Second Brain from Speech in Obsidian

The previous post ended on the idea that checking each claim by hand really does work, but once the clips pile up into dozens, sitting and replaying every line to compare them stops being feasible. So this time I will walk through how we made it a system, so the thing we used to do by hand becomes a pipeline that repeats.

Let me restate the old rule first, because everything in this post is about making that rule happen on its own. Nothing the AI summarizes becomes knowledge until it can point back to evidence in the actual audio. The rest is just how to wire a pipeline around that one rule.

The system has four stages: prep the audio so it is ready to check, then the gate itself, then store it in the vault in good order, and finally wire it all to run on its own. I tell all of these as principles. The tooling that makes it fast and accurate on real work is a black box each team can pick for themselves.

The terms, gathered here in one place

ASR the system that transcribes speech into text
diarization who-spoke-when, telling which sentence belongs to which speaker
chunking cutting audio or text into segments before processing
coreference resolving references, so when someone says "that thing" or "that project" you know what it means
claim each point the AI summarizes, the unit we check one at a time
provenance the span in the actual audio that a claim can point back to
grounding gate the step that requires every claim to carry evidence before it can be stored
MOC (Map of Content) an index note that gathers the ways into other notes
HITL (human in the loop) the point where a person steps in to decide instead of letting the machine decide alone

Stage 1Prep the audio so it is checkable

Why raw transcript still is not checkable

Speech that has just been transcribed into text is not ready to check for evidence right away, because it is missing three kinds of context that let a claim point back. Skip this and the gate downstream meets nothing but ambiguity it cannot rule on.

Who spoke (diarization) if you do not know who said a sentence, then when the AI summarizes "the team decided X" you cannot trace back who actually said it.
How it is segmented (chunking) you can cut by time, by speaker, or by meaning, and each gives a different result. Cutting by meaning keeps each chunk to a single idea, which is far easier to check than raw by-the-minute cuts.
Dangling references (coreference) when someone says "that thing" or "the project from before," resolve it to a real name first, otherwise the meaning is lost the moment you cut it into chunks.

How to do these three

Separate the speakers use a speaker diarization model to tag who spoke in each segment, then align it with the transcribed text so every sentence carries its speaker.
Cut by meaning instead of slicing by the minute, find the seams where the topic shifts. You can have an LLM read it and mark where the subject moves, or use semantic closeness (embeddings) to find the boundary, then cut there.
Resolve dangling references have an LLM read the whole stretch and replace phrases like "that thing" or "the project from before" with the real names, before you cut it into chunks, so the meaning does not slip.

Once all three are done, verify it worked by reading a random chunk: if you can understand it on its own without opening the neighboring chunks, it is prepped well enough to check.

Stage 2The gate as a contract

The heart of the system is seeing the gate as a contract, not some mysterious function. The contract says only what goes in and what comes out. How the inside works is the black box's business, and the upside of seeing it this way is that you can swap the engine inside anytime without touching the rest of the pipeline.

The gate's contract: claim plus evidence in, verified claims out. The inside is a black box.

The two things this contract guarantees

Bind each claim to evidence every claim that passes must come with where its evidence sits in the actual audio. Anything that cannot attach evidence does not pass.
Verify by re-reading, not by trusting a score confirmation means going back to read that span in the source and checking it says the same thing, not looking at a confidence number and believing it. A pretty number does not mean it is true.

The first post in the series left a lesson: "can't find it" is not the same as "fabricated" (remember Edward Thorp). Once you systematize it, the gate works with two hands. The mechanical hand searches the source automatically for where this claim was said, and lifts the window around that spot in front of you. The human hand reads that window and decides whether it really is the same thing. What the machine cannot find means it is unsupported, not fabricated, because the words may have garbled in transcription, or the idea was spoken without naming the exact term. The machine makes it faster, but the human still decides.

What deserves special care is that the AI does not make things up at random, it inserts what "plausibly should be there." That smooth plausibility is exactly where it fails, not the obvious random error that is easy to catch. That is why the gate has to point back to evidence on every claim, not just skim and decide it sounds fine.

There is some tolerance built in: if the AI summarizes with words that are not an exact match but mean the same thing, it can still count as a pass (it does not have to be letter-perfect). How wide that tolerance should be to land just right is a value you tune to your own material; there is no fixed number.

Stage 3Store it in the vault, systematically

Once you have claims that passed the gate, the next step is storing them into Obsidian as knowledge you can actually use, not a pile of notes you cannot find. Three things keep the vault from getting messy.

A note header that always states its origin

Every note carries the same frontmatter, at least four fields: Tags, Date, Source, Status. These fields are basic note-taking hygiene, nothing secret. What matters is Source, which points back to the original audio, and Status, which works as a trust ladder. A note that comes from a tool that already answers with citations is trustworthy from the start. A note synthesized from raw audio starts as "unverified" and is ranked lower until it passes the gate. Once it passes, it moves up to "verified." Open the vault and you see at a glance how far each note can be trusted, instead of everything mixed together.

Overwrite, append, or make a new note

When something new comes in, you have to decide where it goes. The simple rule: if it is an update to the same topic, append to the existing note. If it is a view that contradicts the old one, do not overwrite, keep both sides and link them together. If it is genuinely a new topic, make a new note and link it into the related ones.

An index that updates itself without breaking

Every note gathers at an MOC, a central index note. When a new note arrives, add a way into it from the right MOC by category, do not leave it floating loose. Do this and the vault grows while you can still find things, instead of growing into a mess.

Stage 4Wire it to run on its own

The first three stages can be done by hand. This stage is wiring it to run end to end on its own. The pipeline has this shape.

Pipeline shape: audio → ASR → orchestrator → the gate → Obsidian. The human steps in only when the gate is unsure.

The endpoint connector is the Obsidian Local REST API, which lets you write notes into the vault from outside the app. The most important part of this stage is the principle of HITL (human in the loop), keeping a person in the loop instead of letting the machine decide everything: what the gate is confident about flows through, and what the gate is unsure about should raise a flag and call a human, not be cut silently. Cutting things automatically is losing real material with no one the wiser. Where exactly the confidence line sits before it stops calling a human is a value you tune to the risk of each kind of work.

This post lays out the full principle, enough to assemble it yourself. The tooling that makes it repeatable at real scale is what we are building.

TakeawaysUse it on your own work

What to remember

Prep the audio before checking: separate speakers, cut by meaning, resolve dangling references.
See the gate as a contract: claim plus evidence in, verified claims out, the inside can be a black box.
Confirm by re-reading, not by trusting a confidence score.
Store into the vault with a note header that states its origin, and decide cleanly whether to append, keep in parallel, or make new.
For anything unsure, raise a flag and call a human, do not cut it silently.

How to start

You do not have to wire the whole pipeline from day one. Try Stage 1 and Stage 2 by hand on one old note first. Once you see that prepping the audio plus the gate genuinely help, move on to storing it in the vault systematically, and only then wire it to run on its own as the last step. One stage at a time, the system grows gradually while staying trustworthy at every step.

Systematize turning speech
into a second brain you can trust