Inference economics differ from training: latency, cold-start, billing granularity, and per-token cost all matter more than peak FLOPS. The right GPU for an 8B model is rarely the right GPU for a 405B model. Here's how the three axes split.
Live pricing + 7-day reliability update on every page load. Curation refreshed manually.
192 GB HBM3 per GPU fits Llama-3.1-405B / DeepSeek-V3 in a single 8× box with no tensor-parallel overhead. ROCm has caught up for vLLM and SGLang in 2025. Bare-metal, 100% availability coverage in our 7-day window.
Pod (long-lived) and serverless (per-second billing, sub-30s cold start) both available. GraphQL API with documented lifecycle hooks. Spot pricing exposed for cost control on bursty workloads. Free egress.
Most teams pick H100 by reflex and overpay 4× for inference on ≤13B models. L40S handles 7-13B at int8/fp8 for a fraction of the cost. Free egress. We don't yet show tokens/sec benchmarks — verify on your model before committing.
Marketplace listings (200+ configs) are great for batch and dev work, but the underlying machine can vanish without notice. Reliability scores are real-only after 48h coverage for a reason. Don't use Vast for production inference SLAs.
Egress is often the dominant inference cost. A 100 RPS API serving 200 KB responses pushes ~50 TB/month of egress. That's $4,500/mo on AWS or GCP at $0.087-0.12/GB — and $0 on a free-egress neocloud. Often more than the GPU bill itself. Compare on per-month TCO including egress, not on $/hr alone.
Honesty about gaps beats false confidence. We add data as it becomes structurally available.