OCP Releases ESUN 1.0 Spec for Ethernet-Based Scale-Up AI Fabrics

The Open Compute Project (OCP) released the Ethernet for Scale-Up Networking (ESUN) 1.0 specification, establishing a new architecture for using Ethernet as the fabric inside large AI compute domains. The document, titled “OCP ESUN – Network Operator Requirements – Base Specification 1.0,” defines enhancements designed to support low-latency communication across large GPU clusters used for distributed training, inference, and emerging agentic AI workloads.

The ESUN initiative launched only four months ago at the OCP Global Summit 2025, but participation has quickly expanded to more than 175 companies, up from 12 founding members that included AMD, Arista, Arm, Broadcom, Cisco, HPE Networking, Marvell, Meta, Microsoft, NVIDIA, OpenAI, and Oracle. Meta and Microsoft led the development of the initial specification with input from a wide ecosystem of silicon vendors, system vendors, and hyperscale operators.

The ESUN specification addresses a central challenge emerging in AI infrastructure: the need for high-performance scale-up networks that connect hundreds or even thousands of GPUs in a tightly coupled domain. These networks support model-parallel workloads that require extremely low latency and high bandwidth communication between accelerators. In next-generation AI systems, scale-up clusters are expected to expand from traditional 8-GPU node configurations to clusters exceeding 1,000 GPUs, often spanning multiple racks while maintaining the performance characteristics of a single shared compute domain.

To support these requirements, ESUN introduces several architectural changes to standard Ethernet. One key innovation is the ESUN Header (EH), a compact 4-byte header that replaces the traditional IP/UDP header stack. By removing the 20–40 byte IP header overhead and substituting a streamlined header designed for scale-up environments, ESUN significantly improves packet efficiency—particularly for the small messages common in GPU synchronization traffic.

The ESUN header also reintroduces essential networking functions normally carried in IP fields. These include congestion feedback (EH-ECN), traffic differentiation (EH-CoS), flow labeling for load balancing, and a TTL field to prevent routing loops. Together, these mechanisms allow Ethernet switches to forward packets using Layer-2 addressing while preserving critical congestion control and traffic management capabilities required for large-scale GPU fabrics.

The specification also defines mechanisms to support lossless Ethernet behavior, which is essential for high-performance GPU interconnects. ESUN builds on existing Ethernet congestion control features such as Priority Flow Control (PFC) and adds support for Credit-Based Flow Control (CBFC) and Link Level Retry (LLR) technologies developed within the Ultra Ethernet Consortium. These mechanisms help maintain reliability and minimize packet loss even as link speeds and cluster sizes increase.

Another focus of ESUN is enabling multi-hop hierarchical topologies within scale-up domains. As switch radix limits and GPU lane counts constrain direct connectivity, AI infrastructure increasingly relies on multi-tier network designs. ESUN introduces new load balancing and congestion management mechanisms to ensure consistent latency and high throughput across these larger topologies.

The specification also outlines requirements for both endpoints and switches. Endpoints must support ESUN header generation and respond to congestion feedback signals, while switches must support deterministic forwarding using Ethernet destination addresses, congestion signaling, and load balancing using the ESUN flow label. Security can be provided at Layer 2 using MACsec, since the removal of IP headers makes traditional IPsec mechanisms incompatible with ESUN traffic.

With the release of version 1.0, OCP positions ESUN as a baseline architecture for deploying Ethernet-based scale-up fabrics inside AI clusters. The specification emphasizes openness and interoperability, encouraging silicon vendors and system suppliers to develop ESUN-compliant hardware platforms that can be deployed across hyperscale environments.

• OCP released ESUN 1.0, defining requirements for Ethernet-based scale-up AI fabrics

• ESUN introduces a 4-byte ESUN Header to replace the traditional IP header stack and improve packet efficiency

• The specification targets GPU clusters exceeding 1,000 accelerators connected across racks

• Key features include Priority Flow Control (PFC), Credit-Based Flow Control (CBFC), and Link Level Retry (LLR)

• The architecture enables multi-hop hierarchical topologies for large AI training domains

• More than 175 companies now participate in the ESUN ecosystem

• The specification was developed with leadership from Meta and Microsoft

“The OCP ESUN initiative is focused on enabling Ethernet to serve as the foundation for high-performance scale-up networks inside modern AI systems,” the OCP Networking Project wrote in its announcement.

The full specification is available here: https://www.opencompute.org/documents/ocp-esun-network-operator-requirements-base-specification-rev-1-0-final-pdf

🌐 Analysis

The release of ESUN 1.0 reflects a broader industry effort to adapt Ethernet for the demanding communication patterns of large AI systems. Hyperscale operators increasingly require scale-up fabrics that can connect thousands of GPUs while maintaining extremely low latency and deterministic performance. Historically, these domains relied on proprietary interconnects such as NVIDIA NVLink or InfiniBand-based fabrics.

Industry momentum is now shifting toward open Ethernet-based approaches, including the Ultra Ethernet Consortium and initiatives like ESUN. By defining enhancements such as compact headers, link-level reliability, and congestion control mechanisms tailored for GPU workloads, ESUN aims to make Ethernet a viable alternative for scale-up fabrics inside next-generation AI clusters.

🌐 We’re tracking the latest developments in AI infrastructure. Follow our ongoing coverage at: https://convergedigest.com/category/ai-infrastructure/

🌐 We’re launching the “Data Center Networking for AI” series on NextGenInfra.io and inviting companies building real solutions—silicon, optics, fabrics, switches, software, orchestration—to share their views on video and in our expert report. To get involved, send a note to [email protected] or [email protected].

Tags: #OFC26 ESUN OCP

OCP Releases ESUN 1.0 Spec for Ethernet-Based Scale-Up AI Fabrics

Eridu Raises Over $200M for High-Radix AI Networking Platform

AT&T’s $250B Plan for Fiber, 5G and Satellite Infrastructure

Jim Carroll

Related Posts

Multipath Reliable Connection (MRC) Redesigns Ethernet for GPU AI Clusters

Tower Semiconductor and Coherent Demo 400G/lane in SiPho

HyperLight Pushes TFLN Into 400G-per-Lane AI Networking

Marvell Launches 260-lane PCIe 6.0 Switch

Semtech Showcases 1.6T and 3.2T Interconnect Demos

LightSpeed Photonics Debuts Solderable Near-Packaged Optical Interconnect

AT&T's $250B Plan for Fiber, 5G and Satellite Infrastructure

Categories

Archives