AWS Launches Trainium3 UltraServers to Scale Frontier AI

Amazon Web Services rolled out its new EC2 Trn3 UltraServers, a Trainium3-powered system built for large-scale AI training and high-throughput inference. The new 3nm Trainium3 chip delivers up to 4.4x more compute performance, 4x higher energy efficiency, and nearly 4x more memory bandwidth compared to Trainium2, allowing customers to train larger models, reduce inference latency, and cut operational costs. Each UltraServer integrates up to 144 Trainium3 chips—reaching 362 FP8 PFLOPs per system—and uses AWS-engineered networking to eliminate scale-out bottlenecks.

Trn3 UltraServers introduce enhanced NeuronSwitch-v1 bandwidth and sub-10-microsecond Neuron Fabric networking to support next-generation AI workloads including agentic systems, mixture-of-experts, and reinforcement learning. EC2 UltraClusters 3.0 can now interconnect thousands of UltraServers—scaling to deployments of one million Trainium chips—to support trillion-token training datasets and massive real-time inference fleets. AWS reports that customers such as Anthropic, Karakuri, Metagenomi, NetoAI, Ricoh, and Splash Music are reducing training and inference costs by up to 50% using Trainium, while Decart is achieving 4x faster real-time generative video at half the cost of GPUs.

Amazon Bedrock is already serving production workloads on Trainium3, and AWS confirmed that Trainium4 is in development. The next-generation chip will target at least 6x processing performance (FP4), 3x FP8 gains, 4x memory bandwidth, and support for NVIDIA NVLink Fusion—enabling Trainium4, Graviton, and EFA to share common MGX rack architectures for flexible, mixed GPU–ASIC clusters.

• Trainium3 delivers 4.4x more compute performance, 4x greater energy efficiency, and nearly 4x more memory bandwidth than Trainium2.

• Each Trn3 UltraServer integrates up to 144 Trainium3 chips for 362 FP8 PFLOPs.

• NeuronSwitch-v1 provides 2x the internal bandwidth, with sub-10-microsecond Neuron Fabric chip-to-chip latency.

• EC2 UltraClusters 3.0 scale to 1 million Trainium chips—10x the previous generation.

• Customers report up to 50% lower costs; Decart reports 4x faster real-time generative video at half the cost of GPUs.

• Trainium4 will add NVLink Fusion support for mixed Trainium/GPU workloads inside MGX racks.

🌐 Analysis: Hyperscalers are increasingly relying on custom silicon to control performance, cost, and power efficiency at AI-factory scale. AWS continues to invest in Trainium and Graviton as strategic alternatives to GPU-only architectures, following similar moves from Google (TPU v5p), Microsoft (Maia), and Meta (MTIA). Trainium3 and the upcoming Trainium4 show AWS tightening its integration with high-speed fabrics—now including NVLink Fusion—to support mixed GPU–ASIC clusters and reduce dependency on external supply chains for frontier-model infrastructure.

Jim Carroll

Editor & Publisher

Every article published by Converge Digest is researched, curated, fact-checked and editorially reviewed by Jim Carroll, Editor & Publisher. AI-assisted drafting may be used to accelerate production, but all content is reviewed, refined and approved prior to publication.