A Faster Whisper Alternative for Local Speech-to-Text

This follows the earlier post on turning speech into a trustworthy knowledge base without letting AI make things up. That one stayed at the level of principle. This one goes under the hood, to the part that looked easiest and turned out to be the slowest: transcription. The fix was not a faster machine. It was a model built on a different architecture.

It started simply. I had a long set of Thai lectures I wanted as text. Most clips went straight into NotebookLM and that was that. But a few clips it would not accept, so those had to be transcribed locally. The first free tool that comes to mind is Whisper, the open-source speech model from OpenAI that people use worldwide, along with the Thai forks that fine-tune it for Thai: Pathumma (from Thailand's NECTEC), and Typhoon Whisper (from SCB10X, part of the SCBX group). That is where I hit the wall.

A single ten-minute clip took nearly an hour to transcribe. The whole job was dozens of clips, about seven hours in total. At the time I assumed the machine was not powerful enough. Once I actually dug in, I realised I had been reading the problem wrong the whole time.

Part 1The seven-hour wall, and fixing the wrong thing

The first thing everyone does when transcription is slow is fiddle with the things around the model. Cut the clips shorter. Feed a bigger batch per pass. Move it onto the GPU. Try a smaller model. All of this helps a little, and none of it fixes the root.

The numbers I measured make it clear. Pathumma, a Whisper large-v3 model for Thai, running on a Mac GPU: just 2 minutes of audio took 317 seconds to transcribe, which is 0.38x realtime. That is slower than the speech itself plays. And the full ten-minute clip that dragged on for nearly an hour, the one from the start, came from pushing the whole long clip through in a single pass, where it slows down even more steeply.

There is a mental trap here. Once you see that short clips are fast and long clips are slow in a way that blows up super-linearly, you conclude the fix is to chop the long clip into small pieces and transcribe them one at a time. That does help, but it is managing symptoms, not curing the disease. The slow part is how the model thinks, not the length of the file.

The real thing to fix is not the clip size, not the GPU, and not Whisper's settings. It is deeper than that. It is how the model turns sound into text in the first place.

Part 2The real variable is architecture, not tuning

Why Whisper is slow by nature

Whisper is what is called an autoregressive model. In plain terms, it produces text one token at a time, in sequence. The next token has to wait for the previous one, because it feeds the token it just produced back in as input for the next. Most of its work, then, is guessing the next token, thousands of times over for a single minute of audio.

That shape means it cannot parallelise. However strong your GPU is, it still has to walk one step at a time, in order. On a CPU-only machine it gets even slower. This is why every earlier attempt, whether faster-whisper, Pathumma, or Typhoon Whisper Turbo, was slow. They are all Whisper, all decoding token by token. Chopping the clips was only ever a band-aid.

A different way to transcribe

Another family of speech models works in a completely different way: transducer and CTC models, which are non-autoregressive. Instead of guessing the next token one at a time, they decode short slices of audio in parallel, all at once. Each slice does not wait for the others, so they use the hardware far more fully. This is one reason the big cloud transcription services feel so fast: many of them run this family of model under the hood, not Whisper.

This is not a cloud-only trick. There is a free Thai transducer too. The one I tried is Typhoon ASR Real-time by SCB10X, a FastConformer-Transducer model (NeMo RNN-Transducer, confirmed from the run log). Once I ran it, the numbers were from a different world.

Measured on the same machine, on the same Thai lecture audio. Raw numbers from real run logs.
Model	Architecture	Runs on	Speed
Pathumma (Whisper large-v3)	autoregressive (token by token)	Mac GPU	0.38x
Typhoon ASR Real-time	transducer (parallel)	CPU only	80x

Read that table slowly, because it cuts against intuition. The slow one (0.38x) is running on a GPU. The fast one (80x) is running on a plain CPU. They are about 200 times apart, and the faster one is on the weaker hardware. The hardware did nothing for Whisper. The architecture did.

Five minutes of lecture that Whisper needs minutes to handle, the transducer finishes in 3.8 seconds, on the same machine. This is not tuning something to run better. It is changing how the model works.

The lesson sticks to almost any kind of building with AI. Before you polish the thing in front of you, step back and ask whether you are tuning the right thing at all. Tuning Whisper a little faster, however well you do it, will never beat switching to a model that was built to be fast from the start.

Part 3The structure around it, and how you start

Speed is only half the story, because the text that comes out fast still cannot be trusted whole. Any transcription model, fast or slow, misspells words and mangles English spoken inside Thai. Worse, once you hand the transcript to AI to summarise, it will add things that sound plausible but that nobody actually said.

This is where speed and trust have to come together. So the whole thing is wrapped into a skill that runs as one pipeline: cut the audio, transcribe with the transducer model, then run a gate that forces every claim to point back to evidence in the real audio before it counts as knowledge.

The gate matters more than the transcriber

The skill splits into clear layers:

Intake. Several sources can plug in. Clips NotebookLM accepts get transcribed there, with answers that come with citations. Clips it refuses fall through to the local transcriber. This is where the transducer steps in to replace Whisper as the fallback.
Synthesis. Summarise only from the supplied file, with no outside knowledge mixed in. Keep the core, do not copy word for word.
Grounding gate. The part that makes the whole thing trustworthy. A separate checker has one job: find where each claim, number, and name appears in the real audio. Anything with no anchor gets cut. Nothing is kept "just in case".
Trust labelling. Notes from a self-made transcript are marked as not yet passed, and rank below notes that carry their own citations, until they clear the gate. The label tells you at a glance how far each note can be trusted.

The point worth stressing is that the gate matters more than the transcriber. Bare transcript text has nothing backing it, unlike a NotebookLM answer that arrives with sources. Text with nothing behind it is exactly what AI most easily builds a fabrication onto. The error to fear is not random noise. It is plausible guessing.

What is held back

Every principle in this post is open: choose a transducer over Whisper, build a grounding gate, label trust. The part that took real work is the actual values that make it smooth at scale, the way audio is cut, the model settings, and the glue that ties every step together so it runs on its own. Think of it as a fixed interface with a swappable engine inside, where the values are tuned to your own material. There is no single magic number that works for every job.

If you want to try it, start here

Do not tune Whisper first. If your job is transcribing piles of audio and speed is the bottleneck, try a transducer model first. For Thai there is SCB10X's; other languages have several in the NVIDIA NeMo toolkit, which bundles Conformer-Transducer models for many languages.
Measure with the realtime factor. Time the transcription against the length of the audio. Do not guess which model is faster. Measure it on your own machine and your own audio.
Do not trust the transcript right away. However fast the model is, before you summarise the result, put up a gate that makes every claim point back to the real audio. No source found means cut it.
Keep the source audio and transcript. They are the evidence your notes point back to. Do not delete them once the summary is done.

Try it on a single clip and you will see it for yourself: just changing the model architecture takes a job that ran for hours down to a few seconds, and turns a chore you keep avoiding into something you can do every day.

As for what to do once you have the text, whether from NotebookLM or your own transcript, how to drive it into notes you can actually use, the prompt principles and the skills you can set up to run on their own, that is for the next posts in the series.

The hands-on version

Want the full method behind this post?

The exact script we run, why the chunk length is what it is, the 3 silent failure points we hit, and the permanent environment setup. Enter your email and it opens right away.

Sources and references

All speed numbers in this post were measured on the same machine, from real run logs (Jun 2026), not lifted from elsewhere.
Whisper architecture (encoder-decoder, decodes token by token): Radford et al., Robust Speech Recognition via Large-Scale Weak Supervision
RNN-Transducer: Graves, Sequence Transduction with Recurrent Neural Networks
Conformer: Gulati et al., Conformer: Convolution-augmented Transformer for Speech Recognition
Cloud ASR built on non-autoregressive models: Zhang et al., Google USM
Model used in testing: scb10x/typhoon-asr-realtime on Hugging Face
Thai Whisper model used for comparison: Pathumma-whisper by NECTEC

The Productize series

Start with the principle: the playbook for a trustworthy second brain, the whole process from speech to notes
Where the gate came from: turning speech into trustworthy notes without letting AI make things up
Open the AI up over Discord safely, behind an access gate, the access side of the series
On delegating to the AI: Not Every Action Needs a Human, the three-tier model for controlling the loop
See the whole series at the blog index

Transcribe on your own machine, faster than Whisper