What is vLLM and what is it for?

vLLM is an open-source model server built for fast inference and efficient memory use. Start it and you get an OpenAI-compatible API on your own machine, so calling code barely changes: point the base URL at your box instead of a provider.

How does a 27B model fit on a single GPU?

Through quantization: compressing the model's weights into a smaller number format. NVIDIA's NVFP4 format shrinks a 27B model to roughly 20GB of weights, which fits the 32GB of VRAM on a single RTX 5090 with room left for a 32K context when paired with an fp8 KV cache. The trade: NVFP4 needs a Blackwell-generation card or newer.

Why does vLLM hit out-of-memory on a freshly rented machine?

The case we actually hit: the rental's template image had an old model process still running, holding 28GB of VRAM, so the new model had nowhere to load. Always run nvidia-smi first to confirm the VRAM is really free. If a stray process is there, kill it AND remove the config entry that starts it at boot, or it comes back on the next restart.

Do installed files survive stopping a rented GPU?

Only what lives on the persistent volume. Most GPU rentals distinguish stop (GPU billing pauses, persistent disk kept) from destroy (everything wiped). Anything under the persistent mount, such as /workspace, survives a stop; anything outside it may not. So put both the Python environment and the ~20GB of model weights under the persistent volume from the first command.

vLLM on a Rented GPU: Running a 27B Model, and the Traps

The previous post walked through the cost equation of renting a GPU for your own LLM and ended with a decision: a datacenter-grade RTX 5090 at $0.756 an hour. This post is the doing half: putting a 27B Qwen on that machine with vLLM, from clicking rent to the first real answer coming back.

The core steps are well documented everywhere. What is rarely written down is where it actually breaks along the way, and those breaks cost us several times longer than the install itself. So this is told as: what we did, what we hit, how we fixed it, through all four traps, so you can step over the holes we fell into.

Terms used in this post, all in one place

template: the ready-made image a rental boots with; base services come pre-configured
vLLM: the open-source model server; start it and you get an OpenAI-compatible API on your own box
quantization / NVFP4: compressing model weights into a smaller number format; NVFP4 is NVIDIA's, and needs a Blackwell-generation card
VRAM: the GPU's own memory, where the whole model must sit to run
KV cache: the memory a model uses to hold the conversation in flight; longer context eats more, and the fp8 format roughly halves it
supervisor: the template's caretaker process that starts and watches services; you work through it, not around it
/workspace: the rental's persistent volume; things inside survive a stop, things outside may not
SSH tunnel: an encrypted pipe that makes a port on a far machine appear as localhost on yours, with nothing opened to the internet

Part 1What you need before starting

The goal is a 27B model on one card, so the memory arithmetic has to close first. A full-precision 27B does not fit in 32GB of VRAM. But NVIDIA's NVFP4-quantized build (we used Qwen3.6-27B-NVFP4 from Hugging Face) weighs roughly 20GB, which sits comfortably on an RTX 5090 with room left for an fp8 KV cache and a 32K context.

The format brings its own conditions: NVFP4 needs a Blackwell-generation card (RTX 50xx), a recent CUDA, and a vLLM version that supports it (mid-2026, when we installed, that meant vLLM 0.8 or newer). So when picking a rental, filter on card generation and CUDA version, not just price. How to think about which provider and which machine tier is the whole previous post.

The pre-rent checklist, condensed: a Blackwell card with 32GB+ of VRAM, CUDA 12.8+, disk for ~20GB of weights plus environments, and a machine with a persistent volume (Trap 3 is why).

Part 2Trap 1: the "empty" machine that is not empty

Our rental came with a ready-made LLM template, which is genuinely convenient. But the first attempt to load the 27B hit out-of-memory immediately, on a machine we had rented minutes earlier.

nvidia-smi explained it: an old model process from the template was still running, holding 28GB of the 32GB of VRAM. The template auto-starts a small model as a freebie. The machine we thought was empty had never been empty, so our model had nowhere to sit.

The fix has two halves. Killing the process is the obvious half. The half that matters more: trace and remove the config entry that starts it at boot, or every restart resurrects it and it eats the card again. On our machine that entry lived in the system environment file the supervisor reads when bringing services up.

The lesson that transfers to any rental: the first command on a new machine is always nvidia-smi. Before installing anything, confirm the VRAM is actually free and see what is already sitting on the card. Five minutes here saves hours of wondering why you are OOM.

Part 3Trap 2: config lives in two places, and the template has a caretaker

The first instinct of anyone comfortable on Linux is to edit the script and run it yourself. A template machine does not work that way. It has a supervisor, a caretaker that starts and watches every service. Run vLLM by hand on the side and the supervisor happily starts its own copy too, and the two fight over the card.

The sneaky part is that the config is split across two files: the model name lives in the system environment file, while the vLLM arguments live in a separate file the supervisor reads. Edit one and wonder why nothing changed: that is the hole we fell into once.

The right move is to read the template's own guide before touching anything, then work through the caretaker instead of fighting it. Our machine shipped with a guide file on disk saying exactly which setting lives where. Once read, everything was straightforward: point both files at the new model, tell the supervisor to restart the service, done. No duplicate processes, no surprises at boot.

Part 4Trap 3: things vanish when you stop the machine

The previous post said the one habit that can halve the bill is switching the machine off when idle. On a rental, switching off has fine print: stop and destroy are different things. Stop pauses the GPU billing but keeps the persistent disk; destroy wipes the slate. And the line that matters: only what lives on the persistent volume (on our machine, /workspace) survives a stop.

In practice: both the Python environment and the ~20GB of model weights must live under /workspace from the very first command. Install them in the normal home directory and everything works fine on day one; then you stop the machine overnight to save money, and the morning greets you with a blank box and a 20GB re-download, paying rent while you wait.

With everything in the right place, our stop-when-idle cycle costs only the few minutes of loading the model back onto the card. No reinstalls, no re-downloads, and the rent runs only while the machine is actually working, exactly as the previous post's break-even math wants it.

Part 5Trap 4: the tunnel home drops, and must reconnect itself

The model runs, but it runs on a machine half a world away. The next question: how does the system at home call it without exposing the API port to the open internet? Our answer is an SSH tunnel: the vLLM port on the rental appears as localhost on the home-side machine, so calling code sees the model as if it lived in the house. (The full thinking behind tunnels like this is its own post.)

The tunnel's trap is that it drops silently. One network hiccup and the pipe is gone, and everything depending on it goes quiet with it. What closed the issue for good was handing the tunnel to systemd: a service unit with restart-always, so whenever the pipe breaks it reconnects itself within seconds, no human on watch.

One more souvenir from the field: when driving the rental through nested SSH (home into server, server into rental), stacking the command inside multiple layers of quotes can swallow the output silently. The command seems to run; nothing comes back. We burned several rounds before switching to piping the script into stdin instead of nesting quotes. If a nested command returns emptiness, do not guess: change how you send it.

Part 6What "done" actually means: verify before declaring

"Installed" has many layers, and only one layer is believable: a real chat completion fired through the tunnel, with a real answer coming back. Process up, port open, logs clean: none of those count, because we have watched each of them look green while the real thing was still broken.

Check the model on the card. nvidia-smi should show a single vLLM process holding roughly the VRAM you calculated.
Check the API at the rental's end. Fire a short completion at the rental's own localhost; real tokens must come back.
Check through the tunnel from home. Same question, fired at home-side localhost through the tunnel; same answer must come back.
Check the stop/start cycle. Stop the machine, start it again, and re-run the three checks above. Everything must come back on its own, no hands.

Pass all four and it is done. The numbers from our machine after passing: the 27B answers short jobs steadily, rent runs only while the machine is on, and the home system sees nothing but a localhost endpoint, as if the model lived in the house.

Closing on the usual single principle: the install is the easy part; the real skill is knowing a rented machine is never a blank machine. It has leftovers, it has a caretaker, and it has rules about what survives a stop. Read the machine first, and everything else is straightforward.

The hands-on edition

Want the word-for-word install runbook?

The full setup script, the vllm serve command with every flag explained (why fp8, why 0.55), a copy-and-edit systemd unit for the SSH tunnel, and the four-step verify checklist. Enter your email and it opens right away.

Sources & references

Every event and number here (the 28GB orphan process, ~20GB of weights, 32K context, the $0.756/hr machine) comes from our own install on July 2, 2026, recorded in internal session logs
nvidia/Qwen3.6-27B-NVFP4 (Hugging Face model card)
vLLM: quantization documentation (official)

A 27B model on a rented GPU with vLLM:
the traps are where nobody writes them down