Site icon Converge Digest

NVIDIA Expands Spectrum-X Ethernet With MRC

NVIDIA outlined how its Spectrum-X Ethernet platform and the new Multipath Reliable Connection (MRC) transport protocol are emerging as critical building blocks for the next generation of AI factories. The company said OpenAI, Microsoft, and Oracle are already deploying MRC across some of the world’s largest AI training environments, including Microsoft’s Fairwater supercomputers and Oracle Cloud Infrastructure’s Abilene deployment in Texas. MRC enables a single RDMA connection to distribute traffic dynamically across multiple network paths, improving throughput, resiliency, and load balancing for large-scale AI clusters.

MRC addresses one of the central challenges in AI infrastructure: maintaining high GPU utilization across increasingly massive training systems. Instead of relying on a single communication path, MRC continuously steers traffic across available network routes to avoid congestion and failures in real time. NVIDIA said the protocol supports intelligent retransmission, microsecond-scale failure bypass, and hardware-accelerated rerouting across Spectrum-X Ethernet fabrics. The architecture is designed to minimize GPU idle time during frontier model training runs, where even brief network interruptions can impact thousands of synchronized accelerators.

NVIDIA also highlighted the growing importance of multiplanar network architectures for AI clusters scaling to hundreds of thousands of GPUs. In these designs, multiple independent network fabrics operate simultaneously to improve resiliency and bandwidth efficiency. Spectrum-X Multiplane capabilities provide hardware-assisted load balancing across these fabrics while maintaining predictable low latency. NVIDIA said Spectrum-X Ethernet supports multiple RDMA transport models, including Adaptive RDMA, MRC, and custom protocols running across NVIDIA ConnectX SuperNICs and Spectrum-X Ethernet switches. The company added that MRC has now been released as an open specification through the Open Compute Project with contributions from AMD, Broadcom, Intel, Microsoft, and OpenAI.

“Deploying MRC in the Blackwell generation was very successful and was made possible by a strong collaboration with NVIDIA,” said Sachin Katti. “MRC’s end-to-end approach enabled us to avoid much of the typical network-related slowdowns and interruptions and maintain the efficiency of frontier training runs at scale.”

Exit mobile version