productize.life
TH EN
Cost & Model · Self-hosting

Renting a GPU for your own LLM:
the expensive part is the idle hours

Three ways to pay for the same model: per-token APIs, a rented GPU by the hour, or serverless GPU. Live prices checked today, the break-even formula we actually use, and the data-governance line the price tag never shows.

Yim· written with Dobby (AI Oracle)/July 3, 2026

Yesterday we rented our first GPU: a single RTX 5090 on Vast.ai, a datacenter-grade machine in the EU at $0.756 an hour (the market has offers at half that; we will get to why we paid more), to run a 27B open-weight model (a model whose weights you can download and run yourself) through vLLM for large batches of short jobs. The booking took less than five minutes. Machine up, model serving, answers coming back.

What took far longer was the question before it: should we be renting a machine at all, when the same model is available per token through an API, with no machine to babysit and nothing billed while idle? This post is the homework we did to decide. Every price was checked live on the providers' pages on July 3, 2026, and every formula is one we actually use on our own workload.

Part 1Three ways to pay for the same model

The same open-weight model, say a 27B Qwen, can be paid for in three ways, and the three bill on entirely different logic.

First, per token. An inference provider hosts the model on their hardware; you call an API and pay for what you use. The decisive advantage: your idle hours cost exactly zero. A night with no work is a night with no bill. The trade: every prompt travels through someone else's machine, and the model menu is theirs, not yours.

Second, rent a GPU and run it yourself. You get the whole machine. Any model, any settings, full control. In exchange for the harshest fact of this path: the meter runs every hour, work or no work. A machine left on for a month at $0.36/hr is roughly $260 before it does a single useful thing.

Third, serverless GPU. The middle path. The provider wakes a machine when a request arrives and bills only the seconds it runs. The rate is higher than a dedicated rental (real numbers in the next section), and there is a cold start: the wait while the machine wakes and the model loads. But the idle-hours problem is gone entirely.

Notice that "cheap" and "expensive" have not appeared yet. These three cannot be compared head-on until you know the rhythm of your own workload. Steady all-day traffic and bursty once-a-day batches give opposite answers.

Part 2GPU rental prices today, three providers compared

The table below is for an RTX 5090 (32GB of VRAM, enough for a quantized 27B on a single card), checked live on July 3, 2026 against each provider's official pricing page or live marketplace.

ProviderRTX 5090 per hourWhose machinesBest for
Salad $0.25 A distributed network of 60,000+ GPUs, mostly consumer machines owned by individuals Public-data workloads that can tolerate a leak; unbeatable on price
Vast.ai from $0.36 (verified machines) An auction marketplace: small hosts and datacenters side by side, machine grades you pick, prices move with supply General work at a good price, if you will spend time picking a machine. Where we are (we picked a $0.756 datacenter-grade machine, not the cheapest)
RunPod $0.99 (community pod) Datacenters; stable prices, real support Work where the machine vanishing mid-job is not acceptable
RunPod serverless $1.58 (only while running) Same datacenters, but machines wake per request Bursty workloads that should not pay for idle hours

Do not rush to crown Salad from this table. It is a third of RunPod's price, yes, but the third column matters more than the second, and we will come back to it in Part 4.

Vast's number deserves a footnote: it is an auction market, not a fixed price list. $0.36 was the cheapest verified machine (one that passed the platform's checks) at the moment we looked. Accept unverified machines and you will find cheaper; look tomorrow and the number may move. That is the nature of the market: the good price comes with homework: picking the machine, checking its bandwidth, and accepting that a small host can disappear on you.

Part 3The arithmetic of idle hours

"Is renting worth it?" comes down to one division.

Hourly rent ÷ API cost per job = the jobs per hour you must sustain before the rental starts to win.

Run the numbers: the cheapest market GPU today at $0.36/hr, and a short job (a few hundred tokens of prompt, under a hundred back) at roughly $0.004 through a mid-tier API. That figure is the default in our own measuring script; swap in your own. The division says 90 jobs per hour, sustained. Fire fewer than that and per-token is simply cheaper. On our actual $0.756/hr machine the line sits near 190; on a $0.99/hr RunPod machine it moves up to roughly 250 jobs per hour.

"Sustained" is the whole game. A workload that fires for one busy hour each morning and then goes quiet never reaches the line, because every quiet hour is rent with nothing to show. That is why the title says the expensive part is the idle hours. An hourly rental is not a cost per job; it is a cost per unit of time. Few jobs across many hours makes itself expensive.

Before deciding, we wrote a short measuring script: fire real sample jobs at the machine and count. Seconds per job, jobs per hour, dollars per thousand jobs, against the same thousand through an API, and let the number decide, not the feeling that owning a GPU would be cool. We would recommend the same before you click rent. It takes under an hour to write and can save months of rent.

The other way to move the break-even line down: switch the machine off when idle. Rental markets bill for hours the machine is on. If your work comes in a morning batch and an evening batch, running only those windows means paying only those windows, at the cost of a boot and a model load each time, which for a 27B-class model is minutes. Decide whether your work can wait that long.

Part 4The line the price tag never shows: whose machine holds your data

Back to the question the table left open: why not just take Salad at $0.25?

Because the cheapness comes from the structure. Salad says it plainly: a network of 60,000+ consumer and data center GPUs. In human terms, most of it is strangers' gaming PCs, rented out while their owners are away. Your model spins up, your prompts flow through, on a machine in somebody's bedroom.

For work on data that is already public, that is a genuinely good deal. But if your prompts carry anything private, anything belonging to a customer, anything that hurts when it leaks, no price is low enough. This is not an accusation that anyone will steal your data. It is the same basic rule that keeps confidential documents away from an unknown copy shop, even at half price.

Which answers the question left open at the top: why our machine is not the cheapest row in the table. The work we send up carries data we do not want sitting on a stranger's machine, so we pay $0.756 for a datacenter-grade box instead of $0.36 for one that cannot answer that question.

The rule we actually use has one line: rate the job by how much leakage it can tolerate, then pick the machine tier to match. Public jobs can go to the cheapest tier with a clear conscience. Anything sensitive moves up a tier: verified or datacenter machines at minimum. Anything touching customer data belongs only on machines with a real contract behind them. Compare prices within a tier, never across tiers.

And one advantage of self-hosting that price conversations forget: when the machine is yours, the data never travels to a model provider at all. For some workloads that is not a saving; it is the difference between possible and not.

Part 5Verdict: which path, when

The rules we settled on

  1. Start per-token, always. Zero idle cost, no machine to babysit, until the numbers say it is time to move.
  2. Measure your workload before renting. A short script firing real jobs finds your own break-even line. Do not decide by feeling.
  3. Sustained volume + sensitive data = rent your own. Both conditions together. Without the first you are overpaying an API; without the second you are probably early.
  4. Bursty work: look at serverless GPU. A higher rate, but you pay only for what runs. Budget the cold start honestly.
  5. Pick the machine tier by data sensitivity before you compare prices. Compare within a tier only.
  6. Switch it off when idle. The one habit that can halve the bill.

What we chose

Plainly: our main work still runs on subscriptions and per-token APIs, same as before. The rented GPU is a specialist tool for big, frequent batches where we want the data under our control, and we can switch it off the day the work dries up. If one sentence should leave with you: do not ask whether renting a GPU is cheap. Ask whether your workload can keep it fed all day.

Sources & references
Follow along

Get new posts and free resources first

Leave your email. New posts and the occasional free resource land in your inbox. No spam.

Email only, for updates.

Comments

Join the conversation

Share a thought.

Name is shown publicly. Email stays private and is never shown.

Loading comments…