NVIDIA Expands Spectrum-X Ethernet With MRC

Jim Carroll

1 month ago

NVIDIA outlined how its Spectrum-X Ethernet platform and the new Multipath Reliable Connection (MRC) transport protocol are emerging as critical building blocks for the next generation of AI factories. The company said OpenAI, Microsoft, and Oracle are already deploying MRC across some of the world’s largest AI training environments, including Microsoft’s Fairwater supercomputers and Oracle Cloud Infrastructure’s Abilene deployment in Texas. MRC enables a single RDMA connection to distribute traffic dynamically across multiple network paths, improving throughput, resiliency, and load balancing for large-scale AI clusters.

MRC addresses one of the central challenges in AI infrastructure: maintaining high GPU utilization across increasingly massive training systems. Instead of relying on a single communication path, MRC continuously steers traffic across available network routes to avoid congestion and failures in real time. NVIDIA said the protocol supports intelligent retransmission, microsecond-scale failure bypass, and hardware-accelerated rerouting across Spectrum-X Ethernet fabrics. The architecture is designed to minimize GPU idle time during frontier model training runs, where even brief network interruptions can impact thousands of synchronized accelerators.

NVIDIA also highlighted the growing importance of multiplanar network architectures for AI clusters scaling to hundreds of thousands of GPUs. In these designs, multiple independent network fabrics operate simultaneously to improve resiliency and bandwidth efficiency. Spectrum-X Multiplane capabilities provide hardware-assisted load balancing across these fabrics while maintaining predictable low latency. NVIDIA said Spectrum-X Ethernet supports multiple RDMA transport models, including Adaptive RDMA, MRC, and custom protocols running across NVIDIA ConnectX SuperNICs and Spectrum-X Ethernet switches. The company added that MRC has now been released as an open specification through the Open Compute Project with contributions from AMD, Broadcom, Intel, Microsoft, and OpenAI.

MRC allows a single RDMA connection to utilize multiple network paths simultaneously.
OpenAI deployed MRC during the NVIDIA Blackwell generation to improve training efficiency at scale.
Microsoft Fairwater and OCI Abilene rely on MRC for frontier AI model training infrastructure.
Spectrum-X Ethernet provides hardware-based failure bypass with rerouting measured in microseconds.
MRC dynamically avoids congested network paths during active training jobs.
NVIDIA released the MRC specification through the Open Compute Project.
Contributors to MRC development include AMD, Broadcom, Intel, Microsoft, OpenAI, and NVIDIA.
Spectrum-X Ethernet supports multiplanar network designs for scaling to hundreds of thousands of GPUs.

“Deploying MRC in the Blackwell generation was very successful and was made possible by a strong collaboration with NVIDIA,” said Sachin Katti. “MRC’s end-to-end approach enabled us to avoid much of the typical network-related slowdowns and interruptions and maintain the efficiency of frontier training runs at scale.”