If you'd otherwise drop $2,000 on a 4090 to run Llama or Qwen at home, renting a consumer card by the hour is often the smarter move: no upfront capex, no power bill, and you can size up to a 5090's 32 GB or down to a 3090's 24 GB per job. The three axes below cover the fastest consumer card, the best day-to-day workhorse, and the absolute cheapest 24 GB you can rent. These are GeForce-class cards — great for single-GPU inference and light fine-tuning, not for multi-GPU training.
Live pricing + 7-day reliability update on every page load. Curation refreshed manually.
32 GB GDDR7 and Blackwell-gen throughput make the 5090 the fastest consumer card you can rent. The extra 8 GB over a 4090 fits a 32B model at 4-bit or gives you real KV-cache headroom on a 14B at longer context. Per-second billing and free egress on RunPod. Best single-GPU local-LLM box short of datacenter silicon.
The default workhorse. 24 GB runs 7B-13B comfortably and a 32B at aggressive quantization. Pod (long-lived) and serverless both available, GraphQL API with documented lifecycle hooks, per-second billing, free egress. Less marketplace chaos than a pure spot market — the boring choice that just runs your Ollama / vLLM box.
Same 24 GB VRAM as a 4090 at a fraction of the marketplace floor price — the cheapest way to run a quantized 7B-13B for hobby projects. Slower memory bandwidth and no FP8, so tokens/sec lag a 4090, but for a chat-with-your-docs box that tolerates preemption it's the cost floor. Check the reliability score before committing.
Hyperscalers don't rent GeForce cards (NVIDIA's license forbids datacenter GeForce), so you end up on a datacenter L4 or T4 at hyperscaler markup plus $0.12/GB egress. For a single-user local-LLM box that's all downside: you pay enterprise rates for the same 24 GB-class VRAM a neocloud rents on a 3090 for cents.
VRAM is the binding constraint, not FLOPS. A 7B model at 4-bit needs ~5 GB, a 13B ~9 GB, a 32B ~20 GB, a 70B ~40 GB — so a 70B won't fit a single 24 GB card without heavy offload that tanks tokens/sec. Pick the card by the model size you actually run, and remember consumer cards have no NVLink: stacking two 4090s gives you 48 GB but PCIe-only interconnect, so multi-GPU is fine for pipeline-parallel inference and painful for anything tensor-parallel.
Honesty about gaps beats false confidence. We add data as it becomes structurally available.