The performance of a distributed ML training cluster is bounded by two resources: compute and communication. While GPU compute has received enormous attention, networking infrastructure is equally critical and often less well understood. In a large training run spanning hundreds or thousands of GPUs, collective communication operations — all-reduce, all-gather, reduce-scatter — can consume 20-40% of training time if the network is not properly designed and configured. Getting networking right is not optional; it is a prerequisite for cluster-scale efficiency.
This article provides a thorough examination of the networking technologies and design principles that underpin high-performance AI clusters. We cover the hardware layer (InfiniBand, RoCE, NVLink), network topology choices, the RDMA transport protocol stack, collective communication algorithms, and practical considerations for achieving the bandwidth utilization that distributed training demands.
The Two Levels of AI Cluster Networking
AI cluster networking operates at two distinct levels with very different characteristics: intra-node GPU-to-GPU communication, and inter-node communication across the cluster fabric. Understanding this hierarchy is essential because the optimal networking technology, topology, and software stack differ significantly between them.
Intra-node communication is handled by NVLink, NVIDIA's proprietary high-speed GPU interconnect. NVLink 4.0 (used in H100 systems) provides 900 GB/s of total bidirectional bandwidth across 18 links, approximately 14x the bandwidth of PCIe 5.0 (64 GB/s). Within an HGX H100 server, NVSwitches provide full mesh all-to-all connectivity among all 8 GPUs, meaning any GPU can communicate with any other GPU at full NVLink bandwidth. This makes intra-node collective operations (all-reduce across 8 GPUs) extremely fast — a 10 GB tensor can be all-reduced in under 100 milliseconds at NVLink speeds. The implication for parallelism strategy is that tensor parallelism and pipeline parallelism, which require high-bandwidth low-latency communication, should be confined within a single NVSwitch domain wherever possible.
Inter-node communication uses the cluster's interconnect fabric — typically InfiniBand or RoCE (RDMA over Converged Ethernet). The performance characteristics are fundamentally different: InfiniBand HDR (200 Gb/s per link) delivers approximately 25 GB/s per port, compared to NVLink's 900 GB/s aggregate. Inter-node bandwidth is the primary bottleneck for large-scale training, and the network topology determines whether this bandwidth scales linearly with cluster size or is bottlenecked by oversubscription.
InfiniBand Architecture and Performance
InfiniBand has been the dominant interconnect for HPC and increasingly for AI clusters due to its ultra-low latency (sub-microsecond hardware latency), high bandwidth, and mature RDMA (Remote Direct Memory Access) implementation. RDMA allows GPUs to read from and write to remote GPU memory without involving the remote host's CPU, dramatically reducing latency and CPU overhead for collective operations.
InfiniBand operates through a hardware-managed transport protocol called RDMA Verbs. Applications use the libibverbs API to post send and receive work requests directly to queue pairs on the InfiniBand host channel adapter (HCA). The HCA handles retransmission, flow control, and path management in hardware, keeping the communication latency in the 1-2 microsecond range for small messages. For large messages, the bandwidth-delay product of a 200 Gb/s link at 2 microseconds is approximately 50 KB — meaning message sizes above this threshold are bandwidth-bound rather than latency-bound.
Current InfiniBand generations: EDR (100 Gb/s), HDR (200 Gb/s), and NDR (400 Gb/s, available from 2022-2023). NVIDIA's acquisition of Mellanox (now NVIDIA Networking) has accelerated InfiniBand roadmap alignment with GPU architecture. The DGX H100 SuperPOD reference architecture uses 400 Gb/s NDR InfiniBand with one 400 Gb/s port per GPU, providing 50 GB/s per GPU of inter-node bandwidth. For a 512-GPU SuperPOD cluster, this means an aggregate inter-node bisection bandwidth of approximately 12.8 TB/s before oversubscription.
RoCE: RDMA over Converged Ethernet
RoCE (RDMA over Converged Ethernet) provides RDMA semantics over standard Ethernet infrastructure, offering a potentially less expensive alternative to InfiniBand. RoCEv2 operates over IP-routed Ethernet and is the version deployed in hyperscale AI clusters at companies like Meta (for their Research SuperCluster) and Microsoft Azure. RoCEv2 requires Priority Flow Control (PFC) or DCB (Data Center Bridging) to prevent packet loss, as RDMA's reliability mechanisms assume a lossless transport — dropped packets cause timeouts and retransmissions that severely degrade performance.
The practical performance difference between well-configured 400 GbE RoCEv2 and InfiniBand NDR is modest at the hardware level — both offer similar bandwidth and RDMA latency (RoCE adds ~1 microsecond vs InfiniBand for small messages due to the IP/UDP encapsulation overhead). The more significant differences are operational: InfiniBand uses subnet managers for network configuration and path management, which simplifies some aspects of cluster management but requires InfiniBand expertise. RoCE's standard Ethernet substrate is more familiar to network engineers and reuses commodity switching infrastructure, but requires careful PFC configuration and QoS policies to avoid congestion spreading and ensure lossless delivery.
Congestion control is critical for RoCE performance at scale. ECN-based DCQCN (Data Center Quantized Congestion Notification) is the standard algorithm, adapting injection rates based on explicit congestion notifications from switches. Without effective congestion control, many-to-one communication patterns (common in all-reduce operations) cause incast congestion at target switches, degrading collective performance significantly. Hyperscalers typically tune DCQCN parameters extensively for their specific workloads and topologies.
Network Topology: Fat-Tree and Rail-Optimized Designs
The physical arrangement of switches and cables — the network topology — determines the maximum bisection bandwidth available between arbitrary pairs of nodes and the cost of achieving it. For AI clusters, two topologies dominate: fat-tree (folded Clos) and rail-optimized (also called DGX Pod or GPU-to-GPU topology).
Fat-tree networks provide full bisection bandwidth between any pair of nodes — the bandwidth available from any half of the cluster to the other half equals the total injection bandwidth. A two-tier fat-tree built with 64-port NDR switches can non-blockingly connect 2048 GPUs, with every GPU having a full 400 Gb/s path to every other GPU. Fat-tree is the traditional choice for general-purpose HPC clusters where any communication pattern must be accommodated. The cost is proportional to the number of switch ports: a non-blocking 1024-GPU fat-tree requires approximately 1536 switches (depending on port counts and oversubscription ratio), making cost control through careful oversubscription design important.
Rail-optimized topology co-designs the network with the GPU parallelism strategy. In a typical configuration, each GPU connects to a different "rail" — a dedicated switch serving GPUs of the same rank across all nodes. In 8-GPU-per-node clusters, GPU 0 from every node connects to rail switch 0, GPU 1 connects to rail switch 1, and so on. This ensures that ring-allreduce operations, which pair adjacent ranks, are served by the same rail switch, minimizing inter-switch hops and congestion. Meta's research has shown rail-optimized designs can achieve substantially higher collective bandwidth utilization than equivalent fat-tree designs for standard distributed training workloads.
Collective Communication Algorithms and NCCL
NCCL (NVIDIA Collective Communications Library) is the standard library for GPU collective operations and is tightly integrated with PyTorch's distributed training backends. NCCL implements all-reduce, broadcast, reduce, all-gather, and reduce-scatter using algorithms adapted to the topology detected by the library at initialization. Understanding NCCL's algorithm selection helps diagnose performance issues and design clusters that achieve high collective bandwidth utilization.
For intra-node all-reduce across 8 GPUs connected by NVSwitch, NCCL uses a "tree" algorithm that routes through the NVSwitch mesh, achieving close to the theoretical peak bandwidth. For inter-node all-reduce, NCCL implements ring-allreduce: each node sends data to the next node in the ring, accumulating partial results until the ring is complete. Ring-allreduce's bandwidth consumption is (N-1)/N × peak per-GPU bandwidth for N nodes — nearly optimal for large message sizes. The disadvantage is latency: ring-allreduce completes in O(N) steps, making it latency-bound for small tensors. NCCL's "double binary tree" algorithm improves small-message latency at the cost of slightly lower bandwidth utilization for large messages.
NCCL topology files allow explicit specification of the cluster's physical topology, enabling NCCL to choose ring orderings and tree structures that minimize intra-switch communication. For rail-optimized clusters, providing NCCL with accurate topology information can improve collective bandwidth by 20-30% by ensuring ring members are served by the same rail switch. Profiling NCCL performance with nccl-tests is an essential step in cluster commissioning — it reveals whether the network is achieving theoretical bandwidth and where bottlenecks exist.
Storage Networking for Training Data
The network must serve not only GPU-to-GPU collective traffic but also the high-throughput data ingestion that keeps GPUs fed during training. For large-scale training with large datasets (hundreds of TB to PB of training data), the storage network must provide aggregate read bandwidth sufficient to keep all GPUs busy during data preprocessing and loading. A 512-GPU cluster consuming 6 TB/s of GPU compute may require 50-100 GB/s of storage bandwidth for data loading, depending on the workload.
Parallel file systems (Lustre, GPFS, BeeGFS) scale storage bandwidth by striping data across many storage servers, each with dedicated network connections. Lustre is the most widely deployed in HPC and AI clusters, with a mature ecosystem and support for RDMA-based data transfer. High-speed all-flash storage arrays with NVMe SSDs provide the IOPS and bandwidth density needed for random-access workloads like reinforcement learning from human feedback (RLHF) training with diverse replay buffers. Proper storage networking design — dedicated storage VLANs or separate fabrics, appropriate MTU settings for storage traffic, and storage-specific QoS policies — prevents storage I/O from interfering with inter-GPU collective traffic.
Key Takeaways
- AI cluster networking operates at two tiers: NVLink/NVSwitch for intra-node GPU communication (900 GB/s) and InfiniBand/RoCE for inter-node (200-400 Gb/s per port).
- Parallelism strategies must match network topology: tensor and pipeline parallelism within NVSwitch domains, data and ZeRO parallelism across nodes via InfiniBand.
- InfiniBand provides mature RDMA with sub-microsecond latency; RoCEv2 on 400 GbE can match InfiniBand performance but requires careful PFC and congestion control configuration.
- Rail-optimized topologies outperform generic fat-tree for standard distributed training collective patterns by reducing inter-switch traffic and congestion.
- NCCL topology files are essential for clusters with non-trivial topologies — providing accurate topology information can improve collective bandwidth by 20-30%.
- Storage networking must be sized separately from GPU fabric: parallel file systems with RDMA support are required to sustain data ingestion at cluster scale.
Conclusion
Networking is as fundamental to AI cluster performance as the GPUs themselves. The gap between a poorly networked cluster where collective operations bottleneck training efficiency and a well-designed fabric that approaches the theoretical collective bandwidth limit can be the difference between a competitive and an uncompetitive infrastructure investment. The decisions made at the networking layer — interconnect technology, topology, congestion control, NCCL configuration — have compounding effects across the entire training workload, making networking expertise an essential component of the ML infrastructure skill set.
Continue Reading
Explore more insights on AI infrastructure and distributed computing.
View All Articles