productize.life
TH EN
local LLM · fit to the job

The best model is often the wrong one

When we went to run a local model for real, we almost picked the one topping the benchmark, then realized we had been asking the wrong question all along.

Yim· written with Dobby (AI Oracle)/Jun 24, 2026/~5 min

When we set out to run a local LLM ourselves, the first question that popped up was "which one is best?" Open the benchmark table, see whose coding score is highest, and shortlist that one.

But the moment we tested for real, the model with the pretty benchmark was not the one that worked. The job was to chat in Thai with people on the team, not to solve LeetCode. A sky-high coding score told us almost nothing about whether it could hold a Thai conversation.

That is when it clicked: we had been asking the wrong question from the start. The question is not "which one is best", it is "what is our job, and which one fits it?"

Part 1Choose by the job, not the benchmark

A benchmark tells you "good at what", but not "right for our job". Those are different things. A model with a high coding score may be weak in Thai; a strong generalist may be unreliable at tool-calling.

Before looking at which one is strongest, answer these four first.

  1. What is the job? Writing code, holding a conversation with a consistent character, or long reasoning.
  2. Which language? If the job is in Thai, this matters more than the coding score, because most benchmarks measure in English and say nothing about how well it handles Thai.
  3. Can the data leave for the cloud? If the data is sensitive and cannot leave, local is the only road. Privacy is a selection criterion, not a bonus.
  4. Which failure can you live with? Some models are accurate when you feed them facts but make things up when left to guess. If your job already supplies the facts, that weakness is not a problem.

Order the questions this way and "the strongest one" usually drops out by question 2 or 3.

Part 2What we actually compared: Qwen vs Gemma

When it came to the real choice, we had two on the table: Qwen3.6 and Gemma 4. Both mid-sized, both run on your own machine, both open under the same license (Apache 2.0).

CriterionQwen3.6-27BGemma 4 12B
size27B dense12B dense
context262K (extends to ~1M)256K
licenseApache 2.0Apache 2.0
codingSWE-bench Verified 77.2 (very strong)capable, but not its strength
language201 languages, strong on Asian35+ out of the box, 140+ pretrained, but weaker on Asian
other strengthsmultimodal, has an MoE variant (35B-A3B)multimodal, function-calling, QAT (quality per file size)

Our case: the job is an AI teammate the team talks to in Thai, so the criteria were three: natural Thai, a consistent character (a steady tone and way of speaking, what English calls persona), and the chat data is internal, not something we want sent to the cloud. Measured against those three, Qwen wins, not because of coding (both are fine) but because Qwen is clearly better at Asian languages, even though the overall benchmark tables look like a close call.

That is the whole lesson in one line: the benchmarks are close, but tie it to the real job and the one that fits wins outright.

As for the flagships like GLM-5.2 or Kimi K2.7, they are world-class at coding, true, but they run into the hundreds of billions to a trillion parameters and will not run on an ordinary machine. Route them through the cloud and you lose the privacy that was the whole reason to go local in the first place. Strongest in the world, but wrong for a job that has to live on your own machine.

Part 3What the spec-sheet terms mean in practice

You meet these terms everywhere in a spec sheet. Here is each in plain language, because each one ties to a different hardware limit.

QAT: trained to tolerate the squeeze

A model is just a huge pile of numbers (called weights) stored at high precision, which makes the pile large. "Squeezing" (quantize) means lowering the precision of those numbers, say from 16-bit down to 4-bit, so the model file shrinks enough to run on a small machine. The old way trains the model fully, then squeezes afterward, and quality drops a little because the model never got used to being squeezed. QAT (quantization-aware training) simulates the squeeze during training, so the model learns to tolerate it, and when you squeeze for real the quality stays close to the full model at roughly the same file size. (Google)

MTP: predicting several words at once

When a model writes text, it predicts one word at a time, and each next word has to wait for the previous one, because it guesses the next word from the words already out. Running word by word in a chain like that is the speed bottleneck. MTP (multi-token prediction) trains the model to predict several words ahead at once, then checks in one pass how many it got right. The result is faster text generation with no loss of accuracy. But there is a catch: the speedup comes from verifying many words in parallel, which a GPU does almost for free because it is built for massively parallel work, while a CPU works one piece at a time, so even with words guessed ahead it still has to check them one by one and the benefit nearly vanishes. In short, MTP pays off only if you have a GPU. (Gloeckle et al., 2024)

dense vs MoE: active params vs total params

First, a parameter is the weight, the pile of numbers in the model from a moment ago; the more there are the smarter the model, but the more resources it eats. There are two ways to manage them. A dense model uses every parameter it has each time it produces a word. An MoE model (mixture of experts) splits the parameters into several expert groups and calls only some groups per word, so a model totalling 671B uses just 37B per word. The common misunderstanding: active params (used per word) = speed and compute, but total params (all of them) = the RAM you must load. An MoE has to load the whole pile of weights into the machine up front, and even though it only reaches for a few groups at run time, it still eats RAM equal to the full size. (Mixtral · HuggingFace)

Part 4GPU and RAM are the last gate, not the first

Once you know which one the job needs, then run it through the machine's limits: is there enough RAM, is there a GPU, how hard do you have to squeeze it to fit.

The order matters. Start from the machine ("we have this much RAM, take whatever fits") and you cut the model that fits the job without noticing. Start from the job and the machine is only telling you "which version, how much to squeeze" of the model you already chose, not choosing the model for you.

The machine's ceiling sets the size you can run. It should not set what your job is.

The short version, ready to use: when choosing a local LLM, don't open the benchmark first. Open the questions: what is the job, which language, can the data leave for the cloud, which failure can you live with. Answer those four, then bring in the benchmark and the hardware limits to filter.

Take one job you want AI to do and answer those four. You may find the right fit is not the one topping the table.

References
Read next
Follow along

Get new posts and free resources first

Leave your email. New posts and the occasional free resource land in your inbox. No spam.

Email only, for updates.

Comments

Join the conversation

Share a thought.

Name is shown publicly. Email stays private and is never shown.

Loading comments…