Datacenter cards like the H100 dominate the headlines, but most people running a local LLM do not need 80 GB of HBM. A single consumer card - an RTX 4090, 5090, or even a used 3090 - runs a quantized 7B-34B model perfectly well, and you can rent one in the cloud for less than the price of a coffee per hour.
This post covers what consumer GPUs cost across the clouds we track, what actually fits in their VRAM, where they fall short, and when renting beats buying.
Why rent a consumer GPU instead of buying
A new RTX 4090 runs about $1,600 and a 5090 is well north of that. Renting one in the cloud makes sense when:
- You want to try before you buy. Benchmark your model and toolchain on the exact card you are considering, for a couple of dollars, before committing four figures to hardware.
- The workload is bursty. If you fine-tune a LoRA over a weekend and then sit idle for two weeks, paying by the hour beats a card depreciating on your desk.
- You do not want the heat, noise, or power draw. A 4090 pulls ~450W and a 5090 more. In a home office that is a space heater and a fan that never stops.
- You need a clean, reproducible box. No driver conflicts with your daily-driver machine, no CUDA version juggling.
Cheapest cloud consumer GPUs right now
Here is the cheapest on-demand price per provider for the three most common consumer cards, pulled live from our database:
| Provider | RTX 3090 (24GB) | RTX 4090 (24GB) | RTX 5090 (32GB) |
|---|---|---|---|
| Vast | $0.25 | $0.40 | $0.44 |
| Theta EdgeCloud | $0.10 | $0.53 | $0.64 |
| Shadeform | -- | $0.60 | $0.65 |
| RunPod | $0.46 | $0.69 | $0.99 |
| Yotta | -- | -- | $0.65 |
A few things worth reading off that table. Dependable on-demand pricing from a managed provider like RunPod sits around $0.69/hr for a 4090 and $0.99/hr for a 5090. Marketplace providers such as Vast.ai go lower - 4090s in the low-$0.40s and 3090s around $0.10-0.25/hr - because you are renting spare capacity from individual hosts. Marketplace pricing is variable and host reliability differs, so it is best for interruptible or experimental work rather than anything you need to stay up.
For a curated shortlist of consumer-GPU options, see our consumer GPU picks.
VRAM reality: what fits in 24 GB and 32 GB
VRAM is the constraint that decides everything for local inference. The numbers:
- RTX 3090 - 24 GB
- RTX 4090 - 24 GB
- RTX 5090 - 32 GB
The model weights are the floor; you also need room for the KV cache (which grows with context length and batch size) and activations. Rough guidance for a 24 GB card:
- ~7B-8B models in fp16 (Llama 3 8B, Mistral 7B) fit with room for a modest context window. Weights alone are ~14-16 GB.
- ~13B models in 8-bit quantization fit comfortably; ~13B in fp16 (~26 GB) does not.
- ~30B-34B models in 4-bit (GPTQ/AWQ, Q4 GGUF) fit - a 34B at 4-bit is roughly 18-20 GB of weights, leaving a few gigabytes for context.
The 5090's extra 8 GB (32 GB total) does not unlock a whole new model class on its own, but it buys real headroom: longer context windows on the same model, larger batch sizes for higher throughput, or running a 4-bit ~34B with comfortable KV-cache space instead of scraping the ceiling. It is also meaningfully faster per token thanks to the newer architecture and higher memory bandwidth.
If you need to run a 70B model, a single consumer card will not do it even at 4-bit (a 70B at Q4 is ~40 GB). That is datacenter territory - see the section below.
The catch: no NVLink
Consumer cards have no NVLink. Multi-GPU on these is PCIe-only, so the cards talk to each other over a comparatively slow bus. That makes consumer GPUs excellent for single-GPU local inference and light fine-tuning (LoRA/QLoRA), but a poor fit for multi-GPU tensor-parallel training or multi-node clusters, where the lack of fast inter-GPU interconnect becomes the bottleneck. If your plan involves sharding a model across several GPUs, you want NVLink-equipped datacenter cards instead.
In practice this means: rent one consumer card, run one model on it, and you will have a great experience. Try to stitch four 4090s together to train something large and you will spend most of your time waiting on PCIe transfers.
Rent vs buy: the break-even math
A new RTX 4090 is about $1,600. At a dependable cloud on-demand rate of ~$0.69/hr, the card pays for itself after roughly 2,300 hours of use - around 96 full days of 24/7 runtime, or far longer if you only use it on weekends.
That number is a starting point, not the whole story. Adjust honestly for:
- Electricity. A 4090 under load plus the rest of the machine can pull 600W+. At $0.20/kWh that is ~$0.12/hr you keep paying when you own it, which cloud pricing already bakes in.
- Resale value. Owned hardware retains resale value; cloud spend is gone. A 4090 holds its value reasonably well, which shifts the math toward buying for heavy users.
- Availability and convenience. Owning means the card is always there. Renting means dealing with occasional capacity limits, but zero maintenance, instant access to a 5090 or 3090 when you want to compare, and no upfront cash.
Rule of thumb: if you will genuinely use a GPU for most of the week, every week, for over a year, buying wins. For everything else - experiments, occasional fine-tunes, bursty inference - renting is cheaper and far less hassle.
When to step up to a datacenter card
Move off consumer cards when:
- Your model needs more than 32 GB VRAM - 70B+ models, even quantized, or long-context serving with a large KV cache.
- You need multi-GPU scaling that actually scales - NVLink-equipped H100s or A100s for tensor-parallel inference and real training.
- You are serving production traffic where reliability and sustained throughput matter more than the lowest hourly rate.
For inference specifically, our inference GPU picks walk through the tradeoffs between consumer cards, L40S/L4, and H100-class hardware depending on model size and traffic.
How we get this data
The prices above are live, pulled across 28 providers and updated daily from public sources. Compare current pricing for each card directly: RTX 4090, RTX 5090, and RTX 3090.