• Home
  • Events Calendar
  • Blueprint Guidelines
  • Privacy Policy
  • Subscribe to Daily Newsletter
  • NextGenInfra.io
No Result
View All Result
Converge Digest
Saturday, May 16, 2026
  • Home
  • Events Calendar
  • Blueprint Guidelines
  • Privacy Policy
  • Subscribe to Daily Newsletter
  • NextGenInfra.io
No Result
View All Result
Converge Digest
No Result
View All Result

Home » OIF 448: Google’s AI Challenge – Scaling Networks for 100K+ TPU Clusters

OIF 448: Google’s AI Challenge – Scaling Networks for 100K+ TPU Clusters

May 18, 2025
in Video
A A

At the recent OIF 448G Workshop in Santa Clara, Tad Hofmeister, Optical Hardware Engineer on Google’s Machine Learning Systems team, offered a deep dive into Google’s evolving AI infrastructure and made a compelling case for accelerating industry-wide support for 448Gbps electrical interfaces. Hofmeister, a long-time OIF contributor now focused on data center interconnects for AI workloads, outlined the demands of hyperscale AI clusters—both Google’s custom TPU-based systems and NVIDIA-based GPU clusters—and their growing reliance on high-speed, high-density connectivity to handle scale-up and scale-out traffic.

Hofmeister emphasized that while power and cost are always factors, the central motivation for 448G is simple: XPUs are running out of I/O escape. As Google’s Ironwood TPUs and NVIDIA’s Grace Blackwell GPUs push the limits of on-chip compute, the need to move more data between devices becomes critical. Hofmeister detailed both Google’s proprietary ICI-based TPU topology—which uses optical interconnects between cube-style clusters—and NVIDIA’s rack-contained NVLink GPU architectures, highlighting how both platforms demand massive bandwidth density and flexibility, with increasing adoption of co-packaged copper (CPC) to overcome signal integrity and density challenges.

He urged standards bodies to prioritize fast decision-making, suggesting the industry choose between PAM6 and PAM8 to avoid delays, and supported new front-panel connector MSAs tailored for 448G, even at the expense of backward compatibility. Hofmeister concluded by warning against designs that cannot be reliably serviced at scale and encouraged the community to adopt solutions that support flexibility, testability, and production viability.

• Google’s TPU-based AI clusters use a proprietary interconnect with optical circuit switching between racks, enabling scale-up to 9,216 TPUs per superpod.
• XPU trays must support both copper and optical interconnects via modular OSFPs for flexible deployment.
• The move to 448G is driven by package I/O limitations, not just performance or power savings.
• Google is skeptical that PAM4 will close at 448G and advocates for PAM6 or PAM8.
• Co-packaged copper is critical to bypass PCB limitations and achieve SerDes targets.
• Front-panel pluggables with improved connectors and possibly 12V power are needed to support up to 50W modules for high-performance optics.
• New connector MSAs should prioritize signal integrity over backward compatibility.
• Reliability, serviceability, and supply chain flexibility must be core design principles.

Tad Hofmeister, Optical Hardware Engineer, Google:

“448G isn’t just about speed—it’s about survival. We’re hitting the ceiling on how many SerDes we can escape from these XPUs. The path forward requires rethinking connector design, embracing co-packaged copper, and accepting that some legacy constraints must be broken to get where AI needs us to go.”

Want to be involved our video series? Contact [email protected]
https://ngi.fyi/oif448-google-tad

Tags: 448GoogleOIF
ShareTweetShareSummarizeSummarize
Previous Post

Charter to Acquire Cox in $34.5B Deal

Next Post

OIF 448: Meta on Scaling Bandwidth from 228 to 448G

Jim Carroll

Jim Carroll

Editor and Publisher, Converge! Network Digest, Optical Networks Daily - Covering the full stack of network convergence from Silicon Valley

Related Posts

Financials

Google Cloud Hits $20B Quarter, Fueled by AI Infrastructure Boom

April 29, 2026
Semiconductors

Google Unveils 8th-Gen TPUs, AI Hypercomputer with Million-Scale Clusters

April 22, 2026
Semiconductors

Intel, Google Expand AI Infrastructure Pact Around Xeon and Custom IPUs

April 9, 2026
All

Wiz Expands AI Workload SecurityWiz as Google Completes $32B Acquisition

March 24, 2026
AI Infrastructure

Google Strengthens U.S.-India AI Backbone with New Fiber Routes

February 19, 2026
AI Infrastructure

Cisco AI Summit: Google’s Amin Vahdat on Shorter Cycles for Hardware Deployment

February 3, 2026
Next Post

OIF 448: Meta on Scaling Bandwidth from 228 to 448G

Categories

  • 5G / 6G / Wi-Fi
  • AI Infrastructure
  • All
  • Automotive Networking
  • Blueprints
  • Clouds and Carriers
  • Data Centers
  • Enterprise
  • Explainer
  • Feature
  • Financials
  • Last Mile / Middle Mile
  • Legal / Regulatory
  • Optical
  • Quantum
  • Research
  • Security
  • Semiconductors
  • Space
  • Start-ups
  • Subsea
  • Sustainability
  • Video
  • Webinars

Archives

Tags

5G All AT&T Australia AWS Blueprint columns BroadbandWireless Broadcom China Ciena Cisco Data Centers Dell'Oro Ericsson FCC Financial Financials Huawei Infinera Intel Japan Juniper Last Mile Last Mille LTE Mergers and Acquisitions Mobile NFV Nokia Optical Packet Systems PacketVoice People Regulatory Satellite SDN Service Providers Silicon Silicon Valley StandardsWatch Storage TTP UK Verizon Wi-Fi
Converge Digest

A private dossier for networking and telecoms

Follow Us

  • Home
  • Events Calendar
  • Blueprint Guidelines
  • Privacy Policy
  • Subscribe to Daily Newsletter
  • NextGenInfra.io

© 2026 Converge Digest - A private dossier for networking and telecoms.

No Result
View All Result
  • Home
  • Events Calendar
  • Blueprint Guidelines
  • Privacy Policy
  • Subscribe to Daily Newsletter
  • NextGenInfra.io

© 2026 Converge Digest - A private dossier for networking and telecoms.

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy and Cookie Policy.
Go to mobile version