• Home
  • About
  • Events Calendar
  • Blueprint Guidelines
  • Privacy Policy
  • Manage Email Delivery
  • NextGenInfra.io
No Result
View All Result
Converge Digest
Tuesday, June 9, 2026
  • Home
  • About
  • Events Calendar
  • Blueprint Guidelines
  • Privacy Policy
  • Manage Email Delivery
  • NextGenInfra.io
No Result
View All Result
Converge Digest
No Result
View All Result

Home » Optica Executive Forum: Tech Giants Debate Future of Photonics in AI Clusters

Optica Executive Forum: Tech Giants Debate Future of Photonics in AI Clusters

April 1, 2025
in Optical
A A

by James E. Carroll

San Francisco, March 31, 2025 — At the Optica Executive Forum during OFC 2025, an all-star panel of infrastructure leaders from Microsoft, NVIDIA, Meta, and Arista Networks convened to tackle one of the biggest bottlenecks in AI computing: the interconnect. With hyperscale data centers scaling up to millions of GPUs and petabytes of data in motion, the panel explored how photonic interconnects can bridge the performance, power, and reliability gaps emerging in next-gen AI clusters.

Moderated by Chris Pfistner of Avicena Tech, the session broke down the multi-layered topology of data center networks—front-end, scale-out, and the elusive scale-up layer—where traditional copper links still dominate due to cost and simplicity. But as Microsoft’s Pradeep Sindhu explained, AI workloads now require interconnecting thousands of GPUs per pod with ever-increasing bandwidth per device. “Copper is simply not going to scale to 256 or 512 GPUs in a pod,” Sindhu warned. “The opportunity for optics lies squarely in the scale-up layer.”

NVIDIA’s Ashkan Seyedi showcased the company’s latest advances in co-packaged optics (CPO), introduced just weeks prior, as a way to tackle power inefficiency and reduce network jitter. The new Spectrum-X platform, which integrates optics directly onto the switch package, was framed as a critical enabler of GPU utilization. “Power is directly translatable to money,” Seyedi noted. “With CPO, we can interconnect 3× the number of GPUs at the same network power budget compared to pluggables.”

Meta’s Drew Alduino provided a sobering reality check, grounding the conversation in operational experience. Meta is planning to deploy 1.3 million GPUs this year, backed by a $60–$65 billion CapEx investment. “It’s not all optics,” he said, “but it’s not not optics either.” Alduino emphasized how reliability, not just bandwidth, is now the industry’s Achilles’ heel. One failing optical link can cause a cascading stall across an entire AI training job—an issue that grows exponentially with cluster size. “With 100,000 nodes, you’re failing every 20 seconds unless your network becomes bulletproof.”

Arista’s Andy Bechtolsheim championed linear pluggable optics (LPO) as a more practical and serviceable alternative to CPO. “Yes, you get the same power and latency,” he said, “but pluggables offer better serviceability, faster repair cycles, and open multi-vendor compatibility.” He urged the industry to accelerate development of high-density 64-lane pluggable modules, arguing that many of the benefits attributed to CPO can be achieved in a pluggable form factor without the system-level downsides.


Key Takeaways from the Panel + Q&A

• AI cluster growth is exponential: Meta expects 1.3M GPUs online in 2025, with data centers drawing over 2 GW—equivalent to powering San Francisco.

• Photonic interconnects are already used in scale-out and long-haul links, but have yet to penetrate the scale-up GPU-to-GPU domain.

• Microsoft’s view: Optical transceivers are essential for scaling pod sizes beyond 64 GPUs; copper will hit thermal and signal integrity limits.

• NVIDIA’s Spectrum-X CPO platform promises 3× GPU interconnect density at the same network power footprint as traditional pluggables.

• Meta emphasized that reliability—especially soft/transient failures—has become the biggest barrier to scaling AI infrastructure.

• CPO vs LPO: CPO offers tighter integration and lower power; LPO provides superior modularity and easier diagnostics and replacement.

• Total Cost of Ownership (TCO): Panelists agreed that cost-per-link isn’t the only metric; performance-per-TCO across the entire data center is what really matters.

• Shoreline bottlenecks (limited physical IO off GPUs) are being addressed through 3D packaging, chiplet designs, and short-reach electrical channels.

• Optical Circuit Switching (OCS) is not a substitute for packet switching in AI training workloads—OCS is more akin to a dynamic patch panel.

• Serviceability risk: CPO failures require replacing the entire switch chassis, whereas LPO failures can be isolated to a single module, saving hours.

• Failure trends: Most failures are not lasers but components like wire bonds and connectors; better integration can mitigate risks.

• GR-468 not sufficient: Data center scale brings unique reliability and testing needs not covered by telecom-grade standards.

• Future timeline: Copper will dominate GPU-to-GPU interconnects through 2027, but optical scale-up is inevitable as rack densities rise.

• Call to industry: Bechtolsheim urged development of a new open 64-lane pluggable standard to avoid being locked into closed CPO solutions.


Tags: AristaOFC25Optica
ShareTweetShareSummarizeSummarize
Previous Post

OFC 2025 Panel: Million-GPU Clusters Push Networks to the Breaking Point

Next Post

Lessengers Launches 1.6T Multimode OSFP Transceiver

Jim Carroll

Jim Carroll

Editor and Publisher, Converge! Network Digest, Optical Networks Daily - Covering the full stack of network convergence from Silicon Valley

Related Posts

All

Arista Unveils 1.6T Ethernet Portfolio for Rack-Scale AI Fabrics

June 9, 2026
AI Infrastructure

Optica Executive Forum: Microsoft’s Yawei Yin Explores “Scale-Across”

March 23, 2026
All

Optica Executive Forum: Marvell’s Radha Nagarajan on Optical Interconnects for AI

March 19, 2026
All

Optica Executive Forum: Marvell’s Path to Optical First AI Infrastructure

March 17, 2026
All

Optica Executive Forum: Broadcom on Optical Scale-Up

March 17, 2026
All

Optica Executive Forum: OpenAI – Scaling Now Depends on Interconnect

March 16, 2026
Next Post

Lessengers Launches 1.6T Multimode OSFP Transceiver

Categories

  • 5G / 6G / Wi-Fi
  • AI Infrastructure
  • All
  • Automotive Networking
  • Blueprints
  • Clouds and Carriers
  • Data Centers
  • Enterprise
  • Explainer
  • Feature
  • Financials
  • Last Mile / Middle Mile
  • Legal / Regulatory
  • Optical
  • Quantum
  • Research
  • Security
  • Semiconductors
  • Space
  • Start-ups
  • Subsea
  • Sustainability
  • Video
  • Webinars

Archives

Tags

5G All AT&T Australia AWS Blueprint columns BroadbandWireless Broadcom China Ciena Cisco Data Centers Dell'Oro Ericsson FCC Financial Financials Huawei Infinera Intel Japan Juniper Last Mile Last Mille LTE Mergers and Acquisitions Mobile NFV Nokia Optical Packet Systems PacketVoice People Regulatory Satellite SDN Service Providers Silicon Silicon Valley StandardsWatch Storage TTP UK Verizon Wi-Fi
Converge Digest

A private dossier for networking and telecoms

Follow Us

  • Home
  • About
  • Events Calendar
  • Blueprint Guidelines
  • Privacy Policy
  • Manage Email Delivery
  • NextGenInfra.io

© 2026 Converge Digest - A private dossier for networking and telecoms.

No Result
View All Result
  • Home
  • About
  • Events Calendar
  • Blueprint Guidelines
  • Privacy Policy
  • Manage Email Delivery
  • NextGenInfra.io

© 2026 Converge Digest - A private dossier for networking and telecoms.

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy and Cookie Policy.
Go to mobile version