NVIDIA: Blackwell Ultra Cuts Inference Token Costs by Up to 35x vs. Hopper

Jim Carroll

4 months ago

NVIDIA’s Blackwell Ultra architecture is delivering major gains in inference economics for agentic AI workloads, according to new InferenceX data published by SemiAnalysis. In a February 16 blog post, SemiAnalysis author Ashraf Eassa reported that NVIDIA’s GB300 NVL72 systems provide up to 50x higher throughput per megawatt and as much as 35x lower cost per token versus the prior-generation Hopper platform. The findings focus on low-latency and long-context workloads such as AI coding agents and interactive assistants, which now account for roughly half of AI software programming queries, up from 11% last year, according to OpenRouter’s State of Inference report.

The SemiAnalysis data attributes the gains to a combination of Blackwell Ultra silicon advances and ongoing software stack optimization across TensorRT-LLM, Dynamo, Mooncake and SGLang. GB300 NVL72 integrates Blackwell Ultra GPUs with NVLink Symmetric Memory and optimized GPU kernels designed to minimize idle cycles through programmatic dependent launch. In low-latency inference scenarios, including multi-step agentic coding workflows, GB300 NVL72 delivers up to 35x lower cost per million tokens compared to Hopper. For long-context workloads—such as 128,000-token inputs with 8,000-token outputs—GB300 achieves up to 1.5x lower cost per token than GB200 NVL72, reflecting improvements in NVFP4 compute performance and faster attention processing.

Cloud providers are deploying the platform at scale. Microsoft, CoreWeave and Oracle Cloud Infrastructure are rolling out GB300 NVL72 systems for production inference targeting coding assistants and other agentic AI applications. SemiAnalysis reports that the improvements extend the momentum already seen with Blackwell deployments among inference providers such as Baseten, DeepInfra, Fireworks AI and Together AI, which cited up to 10x reductions in cost per token with earlier Blackwell systems.

• SemiAnalysis InferenceX data shows up to 50x higher throughput per megawatt for GB300 NVL72 vs. Hopper

• Up to 35x lower cost per million tokens for low-latency agentic AI workloads

• 1.5x lower cost per token vs. GB200 NVL72 for 128K-token long-context use cases

• Software optimizations in TensorRT-LLM and Dynamo deliver up to 5x performance gains on GB200 over four months

• NVFP4 compute improves 1.5x and attention processing doubles vs. prior generation

• GB300 NVL72 deployed by Microsoft, CoreWeave and Oracle Cloud Infrastructure for production inference

“As inference moves to the center of AI production, long-context performance and token efficiency become critical,” said Chen Goldberg, senior vice president of engineering at CoreWeave. “Grace Blackwell NVL72 addresses that challenge directly, and CoreWeave’s AI cloud, including CKS and SUNK, is designed to translate GB300 systems’ gains, building on the success of GB200, into predictable performance and cost efficiency. The result is better token economics and more usable inference for customers running workloads at scale.”

https://blogs.nvidia.com/blog/data-blackwell-ultra-performance-lower-cost-agentic-ai

🌐 Analysis: The SemiAnalysis data reinforces the shift in hyperscaler CapEx toward inference-optimized infrastructure as agentic AI and coding workloads expand. NVIDIA’s roadmap—from Hopper to Blackwell Ultra and the forthcoming Rubin architecture—positions throughput-per-megawatt and token economics as primary competitive metrics, an area where rivals including AMD and custom silicon efforts from hyperscalers are also intensifying focus.

🌐 We’re tracking the latest developments in networking silicon. Follow our ongoing coverage at: https://convergedigest.com/category/semiconductors/

🌐 We’re launching the “Data Center Networking for AI” series on NextGenInfra.io and inviting companies building real solutions—silicon, optics, fabrics, switches, software, orchestration—to share their views on video and in our expert report. To get involved, send a note to jcarroll@convergedigest.com or info@nextgeninfra.io.