A 70-billion-parameter LLaMA model requires approximately 140 GB of GPU memory in BF16 precision just to store its parameters — before accounting for activations, optimizer states, or gradients. An H100 GPU with 80 GB of HBM3 cannot hold even half the model. The solution is tensor parallelism: splitting the individual tensor operations that constitute the model across multiple GPUs, each holding a shard of the weight matrices and computing partial results that are then combined.
Tensor parallelism is one of the most technically nuanced distributed training techniques because it requires modifying the computational graph itself — not just distributing whole layers (as in pipeline parallelism) or whole model copies (as in data parallelism), but splitting the computations within individual layers. This article provides a rigorous technical explanation of how tensor parallelism works, why the communication patterns matter enormously, and how modern frameworks implement it efficiently.
The Fundamental Idea: Column and Row Splitting
The core insight behind tensor parallelism is that large matrix multiplications can be split across GPUs in two ways: column-parallel (splitting the weight matrix by columns) or row-parallel (splitting by rows). These two approaches have different communication requirements, and they are designed to be used together in sequence to minimize the number of synchronization points.
Consider a simple linear layer: Y = XA, where X is the input activation matrix and A is the weight matrix. In column-parallel mode, A is split column-wise across N GPUs: A = [A₁ | A₂ | ... | Aₙ]. Each GPU i computes Yᵢ = X·Aᵢ — a partial output. Critically, all GPUs need the full input X, so an all-gather on X is required before computation. The outputs Yᵢ are partial along the column dimension and must be concatenated (no communication required) before passing to the next layer.
In row-parallel mode, A is split row-wise: A = [A₁; A₂; ...; Aₙ]ᵀ. The input X must also be split correspondingly: Xᵢ is the shard of X assigned to GPU i. Each GPU computes a partial sum Yᵢ = Xᵢ·Aᵢ, and the final result Y = ΣYᵢ requires an all-reduce across all GPUs. This is more communication-intensive than column-parallel but pairs naturally as the second layer after a column-parallel layer.
Megatron-LM's Transformer Parallelism Strategy
The seminal work on efficient tensor parallelism for transformers is Megatron-LM from NVIDIA, which identified an elegant pattern for splitting transformer blocks that requires only two all-reduce operations per transformer layer in the forward pass and two in the backward pass — four total, regardless of tensor-parallel degree.
For the MLP sublayer of a transformer: the first linear layer (expansion from d_model to 4*d_model) is implemented as column-parallel; the second linear layer (contraction back to d_model) is row-parallel. In the forward pass: the column-parallel first layer takes the full input and produces partial outputs without communication. The GELU activation is applied locally. The row-parallel second layer takes the partial inputs and produces partial sums that are all-reduced to produce the final output. Total: one all-reduce per MLP sublayer.
For the attention sublayer: Q, K, and V projection matrices are split column-parallel along the attention head dimension. Each GPU computes attention for its subset of heads. The output projection is row-parallel. This distributes attention heads across GPUs naturally, with each GPU computing full attention for its assigned heads. Total: one all-reduce per attention sublayer. Summing across both sublayers: two all-reduce operations per transformer layer in the forward pass.
Communication Volume and Bandwidth Requirements
Understanding the communication volume of tensor parallelism is essential for evaluating its feasibility at different network bandwidths. For a transformer with hidden dimension d_model and a tensor-parallel degree N, each all-reduce on the output of a sublayer communicates a tensor of size (batch_size × seq_len × d_model) values. In BF16, this is 2 × batch_size × seq_len × d_model bytes per all-reduce.
For a typical training configuration — batch size 4, sequence length 2048, d_model 8192 (a 70B-class model) — each all-reduce communicates approximately 135 MB. With 2 all-reduces per layer and 80 layers, that is 21.6 GB per forward pass. The ring-allreduce algorithm sends 2(N-1)/N × data volume per GPU (approximately 2× the tensor size for large N). At the H100's NVLink bandwidth of 900 GB/s, the all-reduce for a single sublayer completes in ~0.3 ms — fast enough that tensor parallelism within an NVSwitch domain has minimal overhead.
The numbers change dramatically if tensor parallelism crosses node boundaries over InfiniBand. At 200 Gb/s (25 GB/s) InfiniBand HDR, the same 135 MB all-reduce takes ~5.4 ms — 18x slower. With 160 all-reduces per forward-backward pass, this adds over 1.7 seconds of pure communication overhead per step, making inter-node tensor parallelism economically unviable for typical training workloads. This is why tensor parallelism is effectively limited to within a single NVSwitch domain (4 or 8 GPUs) in practice.
Expert Parallelism in Mixture-of-Experts Models
Mixture-of-Experts (MoE) architectures like Mixtral and GPT-4 (reportedly) use a different form of model parallelism. In MoE, the dense MLP sublayer is replaced by N expert MLPs, where each token is routed to a small subset of experts (typically 2 out of 8-64 experts) by a learned routing mechanism. Expert parallelism assigns different experts to different GPUs, enabling models with dramatically more parameters than a dense model of the same computational cost.
The communication pattern in expert parallelism is an all-to-all collective: after routing decisions are made, each GPU must send tokens to the GPUs that hold the designated experts, and receive tokens from other GPUs that routed to its local experts. This all-to-all communication pattern is more complex than the all-reduce in standard tensor parallelism and requires careful load balancing to avoid expert capacity bottlenecks. Auxiliary loss terms are commonly added during training to encourage uniform routing across experts.
Flash Attention and Memory-Efficient Attention
Vanilla attention has quadratic memory complexity with respect to sequence length — the attention matrix for a sequence of length L requires O(L²) memory. For a 2K-token sequence, this is manageable, but for 32K or longer contexts it becomes prohibitive. Flash Attention (Dao et al., 2022) restructures the attention computation to avoid materializing the full L×L attention matrix in GPU HBM by using tiled computation and SRAM-level accumulators.
Flash Attention's memory footprint is O(L) rather than O(L²), and because it reduces HBM reads and writes significantly, it is typically 2-4x faster than naive attention for long sequences. Flash Attention v2 and v3 improve GPU utilization further through better parallelism and overlap of compute and memory operations. For any production training or inference system targeting sequences above 2K tokens, Flash Attention is effectively mandatory — the performance and memory benefits are too large to ignore.
Key Takeaways
- Tensor parallelism splits individual matrix operations across GPUs using column-parallel and row-parallel decompositions, requiring only two all-reduce operations per transformer layer.
- The Megatron-LM pattern is the standard for efficient transformer tensor parallelism, splitting MLP and attention sublayers to minimize communication.
- Tensor parallelism is effective only within a single NVSwitch domain (4-8 GPUs) — the communication overhead of inter-node TP over InfiniBand is prohibitive for most workloads.
- Communication volume scales with sequence length, batch size, and hidden dimension — understanding this helps predict when TP degree must be increased.
- Mixture-of-Experts uses expert parallelism with all-to-all communication, enabling trillion-parameter models at reasonable compute cost.
- Flash Attention is essential for long-context training, reducing memory complexity from O(L²) to O(L) while also improving throughput.
Conclusion
Tensor parallelism is one of the most technically elegant techniques in distributed ML, enabling models that simply cannot fit on a single device by splitting computations at the matrix level. The key insight is that carefully chosen column-parallel and row-parallel decompositions of transformer sublayers minimize communication to just two all-reduce operations per layer — keeping synchronization overhead manageable at NVLink bandwidths. As models continue to scale and context lengths grow, understanding the communication mathematics of tensor parallelism becomes essential for ML infrastructure engineers designing efficient training configurations.
Continue Reading
Explore more insights on AI infrastructure and distributed computing.
View All Articles