At AWS re:Invent, Amazon Web Services (AWS) announced the general availability of Trainium2-powered EC2 Trn2 instances and introduced the Trn2 UltraServers, designed for high-performance AI model training and inference. These offerings deliver 30-40% better price performance compared to GPU-based instances. Trn2 instances integrate 16 Trainium2 chips and achieve up to 20.8 petaflops of peak compute, making them ideal for training and deploying large language models (LLMs) and foundation models (FMs). AWS also unveiled Trainium3, its next-generation AI chip, promising a significant leap in performance for model development and real-time inference.
The Trn2 UltraServers go a step further, combining four Trn2 servers into one unified system using the ultra-fast NeuronLink interconnect. This architecture scales up compute power to 83.2 peak petaflops, quadrupling the compute, memory, and networking capabilities of a single instance. AWS is collaborating with Anthropic, an AI safety and research company, to build Project Rainier—an EC2 UltraCluster that will harness hundreds of thousands of Trainium2 chips. This cluster aims to train and deploy cutting-edge AI models at a scale unprecedented in the industry.
AWS Neuron SDK, designed to optimize AI workloads, supports seamless integration with popular machine learning frameworks like PyTorch and JAX. Early adopters, including Anthropic, Databricks, and Hugging Face, are leveraging Trainium2 instances to accelerate model training, reduce costs, and enhance inference capabilities. Trn2 instances are now available in the US East (Ohio) AWS Region, with additional regions to follow. Trn2 UltraServers are currently available in preview.
• Key Highlights:
• Trn2 Instances:
• 16 Trainium2 chips per instance.
• Delivers up to 20.8 petaflops of peak compute performance.
• Provides 30-40% better price performance than GPU-based EC2 instances.
• Optimized for training and deploying AI models with billions of parameters.
• Trn2 UltraServers:
• Combines four Trn2 servers into a unified system.
• Features 64 Trainium2 chips interconnected with NeuronLink for low-latency communication.
• Delivers 83.2 peak petaflops of compute, enabling the training of trillion-parameter models.
• Trainium3 Chip:
• Built on a 3nm process node for higher performance and efficiency.
• Expected to deliver 4x the performance of Trn2 UltraServers.
• Availability projected for late 2025.
• Project Rainier:
• Collaboration with Anthropic to create one of the largest AI compute clusters ever built.
• Hundreds of thousands of Trainium2 chips interconnected with petabit-scale networking.
• More than 5x the exaflop capacity used in previous Anthropic training efforts.
• AWS Neuron SDK:
• Optimizes AI workloads for Trainium chips.
• Compatible with PyTorch, JAX, and over 100,000 Hugging Face models.
• Offers low-code integration for efficient deployment.
• Adopters and Use Cases:
• Anthropic: Scaling its flagship Claude LLM with Trainium2 to enhance AI safety and reliability.
• Databricks: Leveraging Trn2 instances for Mosaic AI to deliver cost-efficient, scalable model training.
• Hugging Face: Enabling faster model development through AWS Trainium-powered infrastructure.
• Poolside: Planning to train and deploy AI systems with significant cost savings using Trainium2 UltraServers.