Training picks split three ways. The cost-optimal provider isn't always the performance-optimal one — multi-node training is interconnect-bound, and free egress matters more than the headline $/hr when you're moving terabytes of checkpoints. Here's how three lenses (performance, ops, cost) actually shake out, with live pricing.
Live pricing + 7-day reliability update on every page load. Curation refreshed manually.
141 GB HBM3e per GPU fits 70B+ models with less sharding than H100. Per-region Capacity Advisor confirms multi-node availability before you submit. Free egress on checkpoints. Disclosure: gpufinder.dev creator works at Nebius — recommendation stands on the H200's specs and the API quality, not the affiliation.
Cleanest API in the industry. 1-Click Clusters expose InfiniBand topology you can plan against. 100% availability coverage in our last 7-day window. Free egress. The boring choice that runs.
≈3× cheaper $/hr than AWS p5 for the same 8×H100 workload. Per-second billing, free egress. Caveat: all-reduce on RunPod is 1.5–3× slower than AWS EFA without InfiniBand fabric — fine for single-node, careful on multi-node FSDP.
Hyperscaler markup + EBS attach times + ICE errors mid-week + $0.087/GB egress. Spot churn is severe (AWS H100:8 swung $1.52→$0.74 in two weeks recently). 1yr reserved makes AWS competitive past ~500 hours/month, but we don't track commit pricing yet — click through to the AWS page for that math.
Per-GPU $/hr lies on multi-node training runs. Without high-bandwidth interconnect (NVLink + InfiniBand 400G or NVSwitch), distributed FSDP can be 2–3× slower than the price comparison suggests. We don't yet surface fabric topology — verify on the provider's own docs before committing to a multi-node run.
Honesty about gaps beats false confidence. We add data as it becomes structurally available.