Alert
How we evolved Google’s global and data center networks for the AI era
Over the last 25 years of building Google’s global network, we’ve navigated major architectural eras — from the Internet, to streaming, and the cloud. Toda
Alert
Over the last 25 years of building Google’s global network, we’ve navigated major architectural eras — from the Internet, to streaming, and the cloud. Toda
Over the last 25 years of building Google’s global network, we’ve navigated major architectural eras — from the Internet, to streaming, and the cloud. Today, we are squarely in the midst of a fourth: the AI era. The applications in the AI era are fundamentally different from the consumer and enterprise applications of the previous eras and impose a set of novel and demanding requirements — on compute resources, of course, but also on the network.
Consider the fundamental physical challenge, which is that it is far more difficult to move electrons (electrical power) than it is to move photons (data over fiber). Because the demand for AI compute frequently outpaces the space and power capacities of individual facilities, we strategically locate data centers near sustainable energy sources, or in locations with pathways to add clean energy sources to the local grid. Then, by utilizing the network to distribute AI workloads across campuses, we create a massive-scale, pooled hypercomputing resource that overcomes the power limitations of any single site.
To deliver this, we created an end-to-end, vertically integrated AI technology stack that comprises everything from chips to systems, to platforms and application and agentic ecosystems. This stack includes a portfolio of pre-built agents and applications; our Gemini Enterprise Agent Platform for you to build, scale, govern, and optimize your AI-enabled applications; world-class AI models; as well as our unified data platform. All this is anchored by our AI Hypercomputer, a unified infrastructure that combines purpose-built hardware and open software, and that comes with flexible consumption options. Our network, forged through decades of innovation, is the essential fabric of the AI Hypercomputer.
The network supporting this stack must meet the stringent bandwidth, scale, and performance needs of AI workloads. This applies not only within the campus, where the network must scale up and out, but also across the wide area network (WAN) along with high-bandwidth interconnects, to bring AI training data from its source to AI compute resources.
To address these challenges, we’ve reimagined three key pillars of our network infrastructure: the fabric inside the AI Hypercomputer, the fabric across the AI Hypercomputer, and our global network. Let’s take a closer look at each of these.
The massive scale of today’s AI models, fueled by the explosive growth of foundational AI model parameters, makes AI training very compute- and network-intensive.
This necessitates an exponential increase in required network bandwidth, with strict bounds on delay (e.g., tail latency) to accommodate AI workloads’ peculiar traffic patterns, which are characterized by sensitivity to performance variation and synchronized bursts, i.e., intense, coordinated, millisecond-level traffic spikes. Furthermore, since large-scale training jobs are uniquely vulnerable to failures and performance stragglers, maintaining high reliability and predictable performance is absolutely essential.
To address the scale, low latency, and high predictability that modern AI workloads require — as well as protection from extreme bursts — we’ve adopted a "campus as a computer" philosophy, decoupling our network into three distinct domains:
a scale-up domain for intra-pod connectivity
a dedicated east-west scale-out accelerator fabric
the Jupiter frontend network for north-south compute and storage access
This decoupled architecture provides three strategic advantages: it allows domains to evolve independently for faster innovation; provides a non-blocking scale-out network with massive training bandwidth; and helps ensure the network can be co-designed in lockstep with new ML accelerators, for superior hardware support.
Recently, we unveiled Virgo Network, our scale-out data center fabric specifically engineered for modern AI. Virgo utilizes high-radix switches and a flat, two-layer non-blocking topology to provide massive bisection bandwidth, while minimizing latency by reducing network tiers. Its multi-planar design, featuring independent control domains for each plane, provides hardware-level resilience and fault isolation. Furthermore, Virgo can expand across multiple data centers, removing physical building limitations and enabling flexible AI compute scaling.
The effectiveness of our network and accelerator codesign is perfectly illustrated by the recently debuted eighth generation TPUs. Within this architecture, Virgo Network can link 134,000 TPU 8t chips with up to 47 petabits/sec of non-blocking bi-sectional bandwidth in a single fabric. Virgo Network delivers up to 4x the bandwidth per TPU 8t accelerator over the previous generation, and 40% lower unloaded fabric latency for TPU 8t compared to the previous generation network for TPUs. In this setup, Virgo Network manages the raw accelerator traffic, while Jupiter provides reliable and rapid access to the global WAN and storage. When integrated with Pathways and JAX, this AI Hypercomputer networking engine facilitates near-linear scaling for up to a million TPU 8t chips in a single logical cluster.
Autonomous reliability: protecting workload goodput
Building a resilient megascale fabric represents only part of the challenge. In a cluster of hundreds of thousands of chips, hardware failures are a statistical certainty. A single stalled instance can stop an entire synchronous training job, wasting valuable compute cycles. As such, efficient fault localization is critical.
We engineered Virgo Network with autonomous reliability capabilities to maximize workload efficiency at scale, a.k.a., goodput. Expanding on our existing straggler detection, Virgo Network now also features automated hang detection. The moment a fail-stop event occurs, our specialized agents immediately localize the fault, isolate the faulty instance, and enable you to restore the training job from a checkpoint — getting your training timeline back on track, with minimal manual intervention. Learn more by watching this demo:
To complement these capabilities, we also use high-resolution, sub-millisecond telemetry to identify elusive network micro-bursts that are usually missed by conventional 30-second monitoring intervals. These high-resolution telemetry advancements enable more efficient network operations, better provisioning, and a lower mean time to recovery.
The exponential growth of modern AI workloads requires us to scale and distribute AI workloads across multiple campuses over a WAN. At the same time, traditional networks weren’t built for the high bandwidth and extreme burstiness of AI traffic, and often fail to detect microbursts that can lead to severe performance degradation. We have developed a suite of innovations to optimize WAN performance for cross-site AI deployments, including:
A multi-shard global network that enables horizontal scaling. Our global network sustained a 10X WAN traffic growth from 2020 to 2025.
Tuning the fabric for essential availability, latency, and quality of service (QoS) attributes. Real-time microburst management helps ensure fair bandwidth allocation and infrastructure isolation across our multi-tenant infrastructure.
Multi-shard isolation to ensure each network shard operates with its own control, data, and management planes.
Combined with regional isolation and Protective Reroute, this architecture minimizes failure impact and shortens user-visible outages — delivering the beyond-nines reliability essential for AI workloads.
Providing high-speed, flexible, and cost-effective interconnectivity is also a priority. AI training relies on vast datasets that are often located on-premises or across various clouds. Given the high cost of AI compute, minimizing idle time is essential; for instance, upgrading from a 100 Gbps link to a 3.2 Tbps connection reduces the time to transfer a petabyte of data from 22.2 hours to just 0.7 hours — a 97% reduction in AI compute idle time spent waiting for data. Our AI-native Cloud Interconnect is purpose-built for the high-bandwidth and low-latency needs of AI workloads, featuring an optimized data path with 400 Gbps links that scale in 3.2 Tbps increments to reach petabit-per-second capacity. It also offers traffic differentiation and flexible connection options, including direct fiber peering and colocation facilities. AI-native Cloud Interconnect supports petabit-scale data transfer with reliable, private connectivity necessary for your cross-cloud AI training and serving.
Applications serving AI inference to a global user population or supporting an agentic enterprise are far more demanding than conventional web apps. The need for opportunistic use of expensive AI compute available at distant locations, distributed service dependencies, and the burstiness of the traffic demand high bandwidth network with a global footprint, as well as deep peering to SaaS providers, ISPs, and hyperscalers. To maintain responsiveness and "always-on" availability, applications need low latency and a highly resilient network.
With its connectivity, scale, and resilience, Google’s global network is well-equipped to handle the demands of the age of AI inference. Our network spans more than 10 million kilometers of terrestrial and subsea fiber, connects our 43 cloud regions, and features 200+ edge locations, providing the essential footprint for serving AI inference. Our Premium Tier network delivers the low latency and reliability needed for consistent, high-quality global user experience. By optimizing traffic entry and exit points, the network significantly boosts application performance, with resilience at the core of this "always-on" infrastructure.
As a Google Cloud customer, these network innovations are built directly into your environment. Google’s network delivers the massive scale, capacity, reliability and performance essential for your AI workloads.
The AI era demands more than just raw compute; it necessitates a robust network fabric to scale. Our vertically integrated AI technology stack — from silicon to software ecosystems — is powered by the AI Hypercomputer to accelerate your transformation and make AI helpful for everyone. Whether through our megascale fabric, resilient global network for inference, or AI-native Cloud Interconnect, we ensure your AI journey is efficient and reliable. We look forward to building this future with you.
Today, Amazon GameLift Streams launched Generation 6e G6e stream classes, providing enhanced GPU performance for streaming high-fidelity, graphically demanding games and applications. The new G6e stream classes are pow…
Amazon SageMaker Unified Studio IAM domains now includes an interactive interface for creating and managing feature groups in SageMaker Feature Store, eliminating the need to write code for common feature management task…
There’s something genuinely energizing about working with startups — something I’ve been doing intensely for more than two years now. Startups operate at a different frequency: the urgency is real, the constraints are ti…