productize.life
TH EN
AI · Speech-to-text

Transcribe on your own machine, faster than Whisper

A batch of Thai lecture audio took almost seven hours to transcribe locally. The slow part was not the hardware. It was the model architecture. Here is what changed when I swapped it.

Yim· written with Dobby (AI Oracle)/14 Jun 2026

This follows the earlier post on turning speech into a trustworthy knowledge base without letting AI make things up. That one stayed at the level of principle. This one goes under the hood, to the part that looked easiest and turned out to be the slowest: transcription. The fix was not a faster machine. It was a model built on a different architecture.

It started simply. I had a long set of Thai lectures I wanted as text. Most clips went straight into NotebookLM and that was that. But a few clips it would not accept, so those had to be transcribed locally. The first free tool that comes to mind is Whisper, the open-source speech model from OpenAI that people use worldwide, along with the Thai forks that fine-tune it for Thai: Pathumma (from Thailand's NECTEC), and Typhoon Whisper (from SCB10X, part of the SCBX group). That is where I hit the wall.

A single ten-minute clip took nearly an hour to transcribe. The whole job was dozens of clips, about seven hours in total. At the time I assumed the machine was not powerful enough. Once I actually dug in, I realised I had been reading the problem wrong the whole time.

Part 1The seven-hour wall, and fixing the wrong thing

The first thing everyone does when transcription is slow is fiddle with the things around the model. Cut the clips shorter. Feed a bigger batch per pass. Move it onto the GPU. Try a smaller model. All of this helps a little, and none of it fixes the root.

The numbers I measured make it clear. Pathumma, a Whisper large-v3 model for Thai, running on a Mac GPU: just 2 minutes of audio took 317 seconds to transcribe, which is 0.38x realtime. That is slower than the speech itself plays. And the full ten-minute clip that dragged on for nearly an hour, the one from the start, came from pushing the whole long clip through in a single pass, where it slows down even more steeply.

There is a mental trap here. Once you see that short clips are fast and long clips are slow in a way that blows up super-linearly, you conclude the fix is to chop the long clip into small pieces and transcribe them one at a time. That does help, but it is managing symptoms, not curing the disease. The slow part is how the model thinks, not the length of the file.

The real thing to fix is not the clip size, not the GPU, and not Whisper's settings. It is deeper than that. It is how the model turns sound into text in the first place.

Part 2The real variable is architecture, not tuning

Why Whisper is slow by nature

Whisper is what is called an autoregressive model. In plain terms, it produces text one token at a time, in sequence. The next token has to wait for the previous one, because it feeds the token it just produced back in as input for the next. Most of its work, then, is guessing the next token, thousands of times over for a single minute of audio.

That shape means it cannot parallelise. However strong your GPU is, it still has to walk one step at a time, in order. On a CPU-only machine it gets even slower. This is why every earlier attempt, whether faster-whisper, Pathumma, or Typhoon Whisper Turbo, was slow. They are all Whisper, all decoding token by token. Chopping the clips was only ever a band-aid.

A different way to transcribe

Another family of speech models works in a completely different way: transducer and CTC models, which are non-autoregressive. Instead of guessing the next token one at a time, they decode short slices of audio in parallel, all at once. Each slice does not wait for the others, so they use the hardware far more fully. This is one reason the big cloud transcription services feel so fast: many of them run this family of model under the hood, not Whisper.

This is not a cloud-only trick. There is a free Thai transducer too. The one I tried is Typhoon ASR Real-time by SCB10X, a FastConformer-Transducer model (NeMo RNN-Transducer, confirmed from the run log). Once I ran it, the numbers were from a different world.

Measured on the same machine, on the same Thai lecture audio. Raw numbers from real run logs.
ModelArchitectureRuns onSpeed
Pathumma (Whisper large-v3)autoregressive (token by token)Mac GPU0.38x
Typhoon ASR Real-timetransducer (parallel)CPU only80x

Read that table slowly, because it cuts against intuition. The slow one (0.38x) is running on a GPU. The fast one (80x) is running on a plain CPU. They are about 200 times apart, and the faster one is on the weaker hardware. The hardware did nothing for Whisper. The architecture did.

Five minutes of lecture that Whisper needs minutes to handle, the transducer finishes in 3.8 seconds, on the same machine. This is not tuning something to run better. It is changing how the model works.

The lesson sticks to almost any kind of building with AI. Before you polish the thing in front of you, step back and ask whether you are tuning the right thing at all. Tuning Whisper a little faster, however well you do it, will never beat switching to a model that was built to be fast from the start.

Part 3The structure around it, and how you start

Speed is only half the story, because the text that comes out fast still cannot be trusted whole. Any transcription model, fast or slow, misspells words and mangles English spoken inside Thai. Worse, once you hand the transcript to AI to summarise, it will add things that sound plausible but that nobody actually said.

This is where speed and trust have to come together. So the whole thing is wrapped into a skill that runs as one pipeline: cut the audio, transcribe with the transducer model, then run a gate that forces every claim to point back to evidence in the real audio before it counts as knowledge.

The gate matters more than the transcriber

The skill splits into clear layers:

The point worth stressing is that the gate matters more than the transcriber. Bare transcript text has nothing backing it, unlike a NotebookLM answer that arrives with sources. Text with nothing behind it is exactly what AI most easily builds a fabrication onto. The error to fear is not random noise. It is plausible guessing.

What is held back

Every principle in this post is open: choose a transducer over Whisper, build a grounding gate, label trust. The part that took real work is the actual values that make it smooth at scale, the way audio is cut, the model settings, and the glue that ties every step together so it runs on its own. Think of it as a fixed interface with a swappable engine inside, where the values are tuned to your own material. There is no single magic number that works for every job.

If you want to try it, start here

  1. Do not tune Whisper first. If your job is transcribing piles of audio and speed is the bottleneck, try a transducer model first. For Thai there is SCB10X's; other languages have several in the NVIDIA NeMo toolkit, which bundles Conformer-Transducer models for many languages.
  2. Measure with the realtime factor. Time the transcription against the length of the audio. Do not guess which model is faster. Measure it on your own machine and your own audio.
  3. Do not trust the transcript right away. However fast the model is, before you summarise the result, put up a gate that makes every claim point back to the real audio. No source found means cut it.
  4. Keep the source audio and transcript. They are the evidence your notes point back to. Do not delete them once the summary is done.

Try it on a single clip and you will see it for yourself: just changing the model architecture takes a job that ran for hours down to a few seconds, and turns a chore you keep avoiding into something you can do every day.

As for what to do once you have the text, whether from NotebookLM or your own transcript, how to drive it into notes you can actually use, the prompt principles and the skills you can set up to run on their own, that is for the next posts in the series.

Sources and references
The Productize series
Follow along

Get new posts and free resources first

Leave your email. New posts and the occasional free resource land in your inbox. No spam.

Email only, for updates.

Comments

Join the conversation

Share a thought.

Name is shown publicly. Email stays private and is never shown.

Loading comments…