San Francisco, March 31, 2025 — In a keynote at the Optica Executive Forum, Amin Vahdat, VP and Fellow at Google, delivered an urgent message for this week’s OFC 2025 conference: the future of AI hinges not just on compute and storage, but on solving the networking bottleneck. “Networking is the number one bottleneck we face,” Vahdat asserted, emphasizing that achieving the next generation of AI breakthroughs will require a radical rethinking of how data moves across systems — from chip-level to global scale.
Vahdat traced the history of computing from the early days of copper links to today’s multi-gigawatt AI infrastructure powered by TPUs and GPUs. He described how AI workloads, particularly model training and serving, now demand ultra-high bandwidth, low-latency interconnects. To support synchronous workloads running on thousands of TPUs, Google has turned to optical circuit switching, using MEMS-based systems to enable real-time failover and reconfiguration. These switches have become essential, not optional, to maintaining the reliability and scale of modern AI clusters.

Key Takeaways
• The Network is the Bottleneck:
• Vahdat called networking the primary limiting factor for AI scaling — more than compute or storage.
• Demand for compute is growing 10× per year, but interconnects aren’t keeping pace.
• Optical Circuit Switching at Google:
• Google’s AI clusters use optical circuit switches (OCS) for interconnecting 144 racks (8,960 TPUs) per pod.
• MEMS-based switches enable real-time reconfiguration after failures, ensuring continued synchronous computation.
• Google reconfigures its OCS multiple times daily to manage failures dynamically.
• Optical losses and transceiver reliability remain critical challenges — with optics being the #1 failure point in Google’s infra.
• MEMS and Reliability:
• MEMS switches have proven reliable once qualified, but yield and packaging remain hurdles.
• Google urged the optical industry to drive better reliability standards beyond traditional telecom benchmarks.
• TPU Network Design:
• Google designed a proprietary non-Ethernet protocol for TPU-to-TPU communication to minimize overhead.
• Each TPU has bandwidth equivalent to a mid-sized Ethernet switch to support intensive, synchronous workloads.
• Liquid Cooling and Power Constraints:
• Liquid cooling isn’t optional anymore; it delivers higher compute-per-watt and supports thermally dense systems.
• Power, not cost, is the main constraint on scaling data centers today — prompting a shift to perf-per-watt as the main design metric.
• AI Architecture Trends:
• Moving from general-purpose CPUs to specialized compute (TPUs, GPUs) is key to efficiency gains.
• Google sees microarchitectural optimization, system-level tuning, and algorithmic improvements as main drivers going forward.
• Latency and Cluster Size:
• Latency is still masked by clever software, but reducing it would unlock more efficient scaling.
• Google is hedging with network designs that could scale to a million-node clusters — even if not all elements are synchronized at once.
• Future Outlook:
• AI’s next phase is about delivering real-time insight, not just information.
• Vahdat predicts massive breakthroughs over the next five years — contingent on solving networking and reliability constraints.