Constructing Prometheus: Backend Aggregation for Gigawatt-Scale AI Clusters

We’re detailing the role of backend aggregation (BAG) in constructing Meta’s gigawatt-scale AI clusters such as Prometheus. BAG enables seamless interconnection of thousands of GPUs across multiple data centers and regions. Our BAG implementation integrates Disaggregated Schedule Fabric (DSF) and Non-Scheduled Fabric (NSF), enhancing networking capabilities. Once completed, Prometheus will deliver 1-gigawatt of capacity to enable AI experiences across Meta products, spanning several data centers and interconnecting tens of thousands of GPUs. BAG is crucial for scaling and connecting this infrastructure, providing robust, high-capacity networking. As our AI clusters grow, BAG will meet future demands and drive innovation across Meta’s global network.

**What Is Backend Aggregation?**

BAG is a centralized Ethernet-based super spine network layer that interconnects multiple spine layer fabrics across data centers and regions within large clusters. In Prometheus, the BAG layer acts as the aggregation point between regional networks and Meta’s backbone, supporting immense bandwidth needs with inter-BAG capacities reaching the petabit range.

**How BAG Is Helping Us Build Gigawatt-Scale AI Clusters**

To interconnect tens of thousands of GPUs, we deploy distributed BAG layers regionally.

**How We Interconnect BAG Layers**

BAG layers are distributed across regions to serve subsets of L2 fabrics, adhering to specific constraints. Inter-BAG connectivity utilizes either a planar or spread connection topology. Planar topology connects BAG switches one-to-one, while spread connection topology enhances path diversity and resilience by distributing links.

**How a BAG Layer Connects to L2 Fabrics**

We’ve used Disaggregated Schedule Fabric (DSF) and Non-Scheduled Fabric (NSF) to build L2 networks. DSF L2 zones across data centers connect to the BAG layer via backend edge pods. NSF L2 connects to BAG planes, with each BAG plane connecting to matching Spine Training Switches.

Careful management of oversubscription ratios balances scale and performance. Typical oversubscription from L2 to BAG is around 4.5:1, varying based on regional needs.

**Hardware and Routing**

BAG utilizes a modular chassis with Jericho3 ASIC line cards, providing scalable interconnects. Routing within BAG uses eBGP with link bandwidth attributes for efficient load balancing. BAG-to-BAG connections are secured with MACsec.

**Designing the Network for Resilience**

We ensure high availability and minimize failures by detailing port striping, IP addressing, and failure domain analysis. Strategies like draining affected BAG planes mitigate blackholing risks.

**Considerations for Long Cable Distances**

BAG’s distributed architecture minimizes L2 edge distances, important for shallow buffer NSF switches. Longer BAG-to-BAG distances require deep buffer switches, supporting lossless congestion control protocols like PFC.

**Building Prometheus and Beyond**

BAG plays a crucial role in Meta’s next-gen AI infrastructure, enabling the gigawatt-scale Prometheus cluster with seamless networking. This design, leveraging modular hardware and resilient topologies, ensures BAG’s adaptability to meet Prometheus’s demands and the future scalability of Meta’s global AI network.

Constructing Prometheus: Backend Aggregation for Gigawatt-Scale AI Clusters

You might also like

Ministry of Culture Collaborates with Naseej for Culture Portal Development, Digitization Lab, and Preservation Repository

PiLink PL-R5/R5M Series: IP20/IP65 Industrial PCs with Raspberry Pi CM5 – CNX Software