Yesterday we rented our first GPU: a single RTX 5090 on Vast.ai, a datacenter-grade machine in the EU at $0.756 an hour (the market has offers at half that; we will get to why we paid more), to run a 27B open-weight model (a model whose weights you can download and run yourself) through vLLM for large batches of short jobs. The booking took less than five minutes. Machine up, model serving, answers coming back.
What took far longer was the question before it: should we be renting a machine at all, when the same model is available per token through an API, with no machine to babysit and nothing billed while idle? This post is the homework we did to decide. Every price was checked live on the providers' pages on July 3, 2026, and every formula is one we actually use on our own workload.
Part 1Three ways to pay for the same model
The same open-weight model, say a 27B Qwen, can be paid for in three ways, and the three bill on entirely different logic.
First, per token. An inference provider hosts the model on their hardware; you call an API and pay for what you use. The decisive advantage: your idle hours cost exactly zero. A night with no work is a night with no bill. The trade: every prompt travels through someone else's machine, and the model menu is theirs, not yours.
Second, rent a GPU and run it yourself. You get the whole machine. Any model, any settings, full control. In exchange for the harshest fact of this path: the meter runs every hour, work or no work. A machine left on for a month at $0.36/hr is roughly $260 before it does a single useful thing.
Third, serverless GPU. The middle path. The provider wakes a machine when a request arrives and bills only the seconds it runs. The rate is higher than a dedicated rental (real numbers in the next section), and there is a cold start: the wait while the machine wakes and the model loads. But the idle-hours problem is gone entirely.
Notice that "cheap" and "expensive" have not appeared yet. These three cannot be compared head-on until you know the rhythm of your own workload. Steady all-day traffic and bursty once-a-day batches give opposite answers.
Part 2GPU rental prices today, three providers compared
The table below is for an RTX 5090 (32GB of VRAM, enough for a quantized 27B on a single card), checked live on July 3, 2026 against each provider's official pricing page or live marketplace.
| Provider | RTX 5090 per hour | Whose machines | Best for |
|---|---|---|---|
| Salad | $0.25 | A distributed network of 60,000+ GPUs, mostly consumer machines owned by individuals | Public-data workloads that can tolerate a leak; unbeatable on price |
| Vast.ai | from $0.36 (verified machines) | An auction marketplace: small hosts and datacenters side by side, machine grades you pick, prices move with supply | General work at a good price, if you will spend time picking a machine. Where we are (we picked a $0.756 datacenter-grade machine, not the cheapest) |
| RunPod | $0.99 (community pod) | Datacenters; stable prices, real support | Work where the machine vanishing mid-job is not acceptable |
| RunPod serverless | $1.58 (only while running) | Same datacenters, but machines wake per request | Bursty workloads that should not pay for idle hours |
Do not rush to crown Salad from this table. It is a third of RunPod's price, yes, but the third column matters more than the second, and we will come back to it in Part 4.
Vast's number deserves a footnote: it is an auction market, not a fixed price list. $0.36 was the cheapest verified machine (one that passed the platform's checks) at the moment we looked. Accept unverified machines and you will find cheaper; look tomorrow and the number may move. That is the nature of the market: the good price comes with homework: picking the machine, checking its bandwidth, and accepting that a small host can disappear on you.
Part 3The arithmetic of idle hours
"Is renting worth it?" comes down to one division.
Hourly rent ÷ API cost per job = the jobs per hour you must sustain before the rental starts to win.
Run the numbers: the cheapest market GPU today at $0.36/hr, and a short job (a few hundred tokens of prompt, under a hundred back) at roughly $0.004 through a mid-tier API. That figure is the default in our own measuring script; swap in your own. The division says 90 jobs per hour, sustained. Fire fewer than that and per-token is simply cheaper. On our actual $0.756/hr machine the line sits near 190; on a $0.99/hr RunPod machine it moves up to roughly 250 jobs per hour.
"Sustained" is the whole game. A workload that fires for one busy hour each morning and then goes quiet never reaches the line, because every quiet hour is rent with nothing to show. That is why the title says the expensive part is the idle hours. An hourly rental is not a cost per job; it is a cost per unit of time. Few jobs across many hours makes itself expensive.
Before deciding, we wrote a short measuring script: fire real sample jobs at the machine and count. Seconds per job, jobs per hour, dollars per thousand jobs, against the same thousand through an API, and let the number decide, not the feeling that owning a GPU would be cool. We would recommend the same before you click rent. It takes under an hour to write and can save months of rent.
The other way to move the break-even line down: switch the machine off when idle. Rental markets bill for hours the machine is on. If your work comes in a morning batch and an evening batch, running only those windows means paying only those windows, at the cost of a boot and a model load each time, which for a 27B-class model is minutes. Decide whether your work can wait that long.
Part 4The line the price tag never shows: whose machine holds your data
Back to the question the table left open: why not just take Salad at $0.25?
Because the cheapness comes from the structure. Salad says it plainly: a network of 60,000+ consumer and data center GPUs. In human terms, most of it is strangers' gaming PCs, rented out while their owners are away. Your model spins up, your prompts flow through, on a machine in somebody's bedroom.
For work on data that is already public, that is a genuinely good deal. But if your prompts carry anything private, anything belonging to a customer, anything that hurts when it leaks, no price is low enough. This is not an accusation that anyone will steal your data. It is the same basic rule that keeps confidential documents away from an unknown copy shop, even at half price.
Which answers the question left open at the top: why our machine is not the cheapest row in the table. The work we send up carries data we do not want sitting on a stranger's machine, so we pay $0.756 for a datacenter-grade box instead of $0.36 for one that cannot answer that question.
The rule we actually use has one line: rate the job by how much leakage it can tolerate, then pick the machine tier to match. Public jobs can go to the cheapest tier with a clear conscience. Anything sensitive moves up a tier: verified or datacenter machines at minimum. Anything touching customer data belongs only on machines with a real contract behind them. Compare prices within a tier, never across tiers.
And one advantage of self-hosting that price conversations forget: when the machine is yours, the data never travels to a model provider at all. For some workloads that is not a saving; it is the difference between possible and not.
Part 5Verdict: which path, when
The rules we settled on
- Start per-token, always. Zero idle cost, no machine to babysit, until the numbers say it is time to move.
- Measure your workload before renting. A short script firing real jobs finds your own break-even line. Do not decide by feeling.
- Sustained volume + sensitive data = rent your own. Both conditions together. Without the first you are overpaying an API; without the second you are probably early.
- Bursty work: look at serverless GPU. A higher rate, but you pay only for what runs. Budget the cold start honestly.
- Pick the machine tier by data sensitivity before you compare prices. Compare within a tier only.
- Switch it off when idle. The one habit that can halve the bill.
What we chose
Plainly: our main work still runs on subscriptions and per-token APIs, same as before. The rented GPU is a specialist tool for big, frequent batches where we want the data under our control, and we can switch it off the day the work dries up. If one sentence should leave with you: do not ask whether renting a GPU is cheap. Ask whether your workload can keep it fed all day.
- All prices checked July 3, 2026: RunPod pricing (RTX 5090 pod $0.99/hr, 4090 $0.69/hr, serverless 5090 $1.58/hr) · Salad pricing (RTX 5090 $0.25/hr, 4090 $0.16/hr, and their own wording "60,000+ consumer and data center GPUs") · Vast.ai checked against the live marketplace via their public API (cheapest verified 5090 at $0.361/hr at check time; auction prices move with supply)
- The rate we actually pay on Vast.ai is $0.756/hr for a single RTX 5090 on a secure datacenter-grade machine in the EU; $0.36 was the cheapest verified market offer at check time. The break-even formula and the $0.004-per-job figure come from the measuring script we wrote for our own workload
- vLLM (official documentation)