The H200 is not a faster H100. It is the same Hopper chip with a bigger fuel tank.
Same 1,979 TFLOPS FP16. Same NVLink 4.0. Same 700W TDP. NVIDIA swapped the memory subsystem: 80 GB HBM3 to 141 GB HBM3e, with bandwidth up 43%. The question is whether that memory upgrade justifies the price premium - and the answer depends entirely on your workload.
What is actually different
The H200 uses the identical GH100 die as the H100 - same 132 streaming multiprocessors, same 4th-gen Tensor Cores, same compute capability. The only change is the memory:
| Spec | H100 SXM5 | H200 SXM |
|---|---|---|
| Architecture | Hopper (GH100) | Hopper (GH100) - same die |
| Memory | 80 GB HBM3 | 141 GB HBM3e |
| Memory bandwidth | 3.35 TB/s | 4.8 TB/s |
| FP16 Tensor Core | 1,979 TFLOPS | 1,979 TFLOPS - identical |
| NVLink | 4.0 (900 GB/s) | 4.0 (900 GB/s) - identical |
| TDP | 700W | 700W - identical |
| Release | 2023 | 2024 |
HBM3e raises per-pin transfer rate from 6.4 Gbps to 9.8 Gbps and adds a sixth memory stack. The result: 76% more capacity and 43% more bandwidth, with zero change to compute throughput.
Both use the SXM5 socket and are mechanically compatible with the same NVLink Switch systems. The H200 is a drop-in replacement in DGX H100 baseplates. Same CUDA compute capability (9.0), same driver stack - zero code changes required to migrate.
Cloud pricing comparison
Here is what H100 and H200 cost per hour across the providers we track, pulled live from our database:
| Provider | H100 /hr | H200 /hr | Difference |
|---|---|---|---|
| Hyperbolic | $1.29 | - | - |
| Vast | $1.80In stock | $2.35Limited | +31% |
| Cudo | $1.82In stock | - | - |
| Seeweb | $1.89 | $2.60 | +38% |
| PrimeIntellect | $1.90In stock | $2.00In stock | +5% |
| Hyperstack | $1.90In stock | - | - |
| Shadeform | $1.90In stock | $3.44In stock | +81% |
| Verda | $2.29 | $3.39 | +48% |
| Theta EdgeCloud | $2.29In stock | $2.29In stock | 0% |
| RunPod | $2.39Limited | $3.39Limited | +42% |
| Scaleway | $2.52In stock | - | - |
| Lambda | $2.86 | - | - |
| Nebius | $2.95 | $3.50 | +19% |
| Google Cloud | $4.96 | - | - |
| Digital Ocean | $6.74In stock | - | - |
| AWS | $6.88 | - | - |
| Azure | $6.98 | - | - |
| Yotta | - | $2.10 | - |
The H100 starts at around $1.29/hr on budget providers and the H200 starts around $2.14/hr - roughly a 60-70% premium. The gap narrows at higher-tier providers where both GPUs approach $3-4/hr.
H100 spot pricing is widely available, often dipping below $1/hr. H200 spot exists but is thin and unreliable - do not build a pipeline that depends on it.
When the H200 is worth the premium
Large model inference (70B+ parameters)
This is where the H200 earns its price. Llama 3 70B in FP16 uses roughly 140 GB of VRAM - it fits on a single H200 but needs two H100s. The math:
- 1x H200 at median $2.80/hr
- 2x H100 at median $2.45/hr each = $4.90/hr total
The H200 is 43% cheaper and eliminates tensor parallelism overhead entirely. For production inference serving, this also means half the instances to manage, half the egress surface, and no cross-GPU communication latency. See our egress fees comparison for the hidden cost that stacks up here.
Long-context serving (128K+ tokens)
KV cache for long-context models is where the H200 pulls furthest ahead. At 128K context length, a single sequence with a 70B model can push 60-80 GB of KV cache on top of the model weights. The H100 at 80 GB total gets evicted constantly, while the H200 holds full context in memory.
The throughput difference is not 43% - it is often 3-5x because you eliminate KV cache offloading entirely. For document QA, RAG with large retrievals, or multi-turn agents running long sessions, this is the deciding factor.
Mixtral and large MoE models
Mixtral 8x7B in FP16 uses roughly 90 GB. One H200 handles it; one H100 cannot. Mistral Large (123B) requires two H100s or one H200 with quantization. If you are running mixture-of-experts architectures, the extra VRAM eliminates the need for model sharding.
When the H100 is the better choice
Models under 70B parameters
Llama 3 8B, Mistral 7B, CodeLlama 13B at INT8, Stable Diffusion XL - all run identically on H100 and H200. At $1.29/hr versus $2.14/hr minimum, you are paying a 66% premium for memory you will never touch.
For inference serving of sub-70B models, the H100 is the right call.
Compute-bound training
Same TFLOPS means same training throughput. Distributed training on 70B+ models is NVLink-bound and network-bound, not memory-bandwidth-bound per card. The 43% memory bandwidth uplift helps data loading and optimizer steps marginally - expect 5-10% end-to-end speedup at best for multi-node training. At current pricing, the ROI is not there for training clusters. Spend the budget on more H100 nodes instead.
Spot pricing advantage
H100 spot is liquid and mature - available across 20+ providers at prices as low as $1.29/hr. H200 spot supply is thin. For batch workloads and preemptible inference jobs where interruption is tolerable, H100 spot is hard to beat on unit economics.
Cost breakdown by scenario
| Scenario | H100 cost | H200 cost | Winner |
|---|---|---|---|
| 7B model training, 100 hours | $129 (1x) | $214 (1x) | H100 saves 40% |
| 70B inference, 24/7, monthly | $3,528 (2x) | $2,016 (1x) | H200 saves 43% |
| Fine-tuning 70B, 10 hours | $49 (2x) | $28 (1x) | H200 saves 43% |
| 8B inference serving, monthly | $929 (1x) | $2,016 (1x) | H100 saves 54% |
At scale the numbers become significant. Running ten 70B inference instances saves roughly $15,000 per month by choosing H200 over 2x H100 configurations.
Rule of thumb: if your model fits comfortably in 80 GB, default to H100. If it needs 120 GB or more, the H200 pays for itself within the first month of continuous use.
Availability and migration
The H100 is the safer bet for immediate capacity - 20+ providers, mature spot market, available in every major cloud region. The H200 is expanding but still concentrated at providers with direct NVIDIA partnerships or newer rack deployments.
Migration from H100 to H200 requires zero software changes. Same CUDA 9.0, same driver stack, same NVLink topology. The only migration work is re-benchmarking memory-bound workloads and adjusting batch sizes upward to exploit the extra 61 GB.
Both are 700W SXM - same rack density, same cooling contracts, same PDU math. If you have qualified a facility for H100, the H200 drops in without renegotiating power.
Verdict
- Model fits in 80 GB? H100. Save 40-60%.
- Need 120 GB+ VRAM? H200. It pays for itself.
- Budget-constrained? H100 spot at $1.29/hr.
- Production inference, 70B+? H200. Fewer instances, lower total cost.
- Distributed training? H100. Same compute throughput, better economics.
The H200 is not a generational leap - it is a targeted memory upgrade for workloads that were memory-constrained on the H100. If that describes your workload, the premium is justified. If not, the H100 remains the best value in cloud GPUs.
Compare live pricing for both: H100 cloud pricing | H200 cloud pricing | How we collect data