Serving large language models in production is a fundamentally different engineering challenge from training them. While training is primarily a throughput optimization problem — maximizing tokens processed per second across a distributed cluster — inference must simultaneously optimize for latency, throughput, cost, and reliability under variable load. This article surveys the key techniques that production LLM serving teams use to meet these competing demands.
Understanding the LLM Serving Performance Profile
LLM inference consists of two distinct computational phases with very different characteristics. The prefill phase processes the input prompt and generates the initial KV cache — it is highly parallelizable and compute-bound. The decode phase generates output tokens one at a time, reading the KV cache and running a forward pass for each generated token — it is memory-bandwidth-bound and inherently sequential.
This decomposition has important practical implications. The first-token latency (time-to-first-token, TTFT) is dominated by prefill computation. Subsequent token latency (inter-token latency, ITL) is dominated by KV cache reads. Optimizing these two phases requires different techniques.
Continuous Batching
Traditional static batching for inference groups multiple requests into a single batch and waits for all sequences in the batch to complete before starting new requests. This is highly inefficient when sequence lengths vary significantly, as GPU resources sit idle waiting for the longest sequence in each batch to finish.
Continuous batching (also called iteration-level scheduling or in-flight batching) addresses this by decoupling request handling from batch boundaries. At each decode iteration, new requests can be inserted into the batch as soon as slots become available from completing sequences. This dramatically improves GPU utilization under variable-length workloads and is now the standard approach in production LLM serving systems including vLLM, TensorRT-LLM, and TGI.
The throughput improvement from continuous batching over static batching can be substantial — benchmarks on Llama-2 70B show 3-5x throughput improvements at typical enterprise request distributions, which tend to have highly variable input and output lengths.
PagedAttention and KV Cache Management
The KV cache stores the key-value projections for all tokens in the context window for each active request. For a 70B parameter LLM with a 4K context window, a single request's KV cache consumes approximately 1.3 GB of GPU memory. This means that serving 80 concurrent requests on a single 80 GB GPU leaves only 2 GB for model weights — clearly infeasible.
vLLM's PagedAttention mechanism solves this by managing KV cache in fixed-size pages rather than allocating contiguous memory blocks per sequence. Pages are allocated on-demand as sequences grow and can be shared across requests using the same prompt prefix (prefix caching). This reduces KV cache memory fragmentation by up to 90% and enables 5-10x more concurrent requests on the same GPU memory budget compared to naive KV cache management.
Quantization for Inference
Quantization reduces the numerical precision of model weights and activations, dramatically reducing memory footprint and increasing throughput on hardware with efficient integer arithmetic units. The key tradeoff is inference quality versus efficiency.
INT8 weight-only quantization is the most widely deployed technique. Weights are quantized to 8-bit integers while activations remain in FP16 or BF16. This halves the model's memory footprint with minimal quality degradation (typically less than 0.5% on standard benchmarks). GPTQ and AWQ are the leading algorithms for post-training INT8 quantization.
INT4 quantization reduces model size to 25% of FP16 baseline. Quality degradation is more significant but often acceptable for non-critical applications. GGUF format with Q4_K_M quantization has become popular for local inference deployments due to its excellent quality-efficiency balance.
FP8 quantization, supported natively by H100 GPUs through the Transformer Engine, offers a sweet spot between INT8 and INT4 — significant throughput improvement over FP16/BF16 with minimal quality loss. As FP8 training and inference tooling matures through 2025, it is likely to become the default precision for production LLM serving on H100 hardware.
Speculative Decoding
Speculative decoding is a technique that leverages the observation that LLM decode steps are memory-bandwidth-bound, not compute-bound — GPUs are underutilized during single-token generation. The algorithm uses a small, fast draft model to generate multiple candidate tokens speculatively, then uses the full target model to verify or reject them in parallel.
When the draft model's predictions are accepted, multiple tokens are produced per target model forward pass, effectively increasing throughput without degrading output quality (acceptance is rejection-sampling-based, guaranteeing the output distribution matches the target model). The practical speedup depends on the acceptance rate, which varies by task — coding tasks tend to see 2-3x speedups while open-ended generation may see 1.3-1.5x.
Disaggregated Prefill and Decode
An emerging architectural pattern separates prefill and decode computation onto different GPU instances. Prefill instances are optimized for high-throughput parallel computation, while decode instances are optimized for low-latency KV cache reads. This disaggregation allows independent scaling of each phase based on workload characteristics and avoids the latency interference between prefill and decode when they compete for the same GPU resources.
Tensor Parallelism for Inference
For very large models (70B+ parameters) that cannot fit on a single GPU, tensor parallelism distributes the model across multiple GPUs. Common deployments shard a 70B model across 4 or 8 GPUs within a single NVLink-connected node. The communication overhead of tensor parallelism is manageable within a single NVSwitch domain but becomes prohibitive across InfiniBand — inference-specific tensor parallelism should generally be confined within a single server node.
Observability and SLA Management
Production LLM serving requires comprehensive latency histograms (P50, P95, P99 TTFT and ITL), throughput metrics (tokens/second, requests/second), and GPU utilization tracking. Effective capacity planning requires modeling the relationship between concurrency, input length distribution, and output length distribution — these three factors jointly determine the operating point on the latency-throughput curve. Autoscaling policies should be based on queue depth and latency percentiles rather than simple GPU utilization, which is a poor proxy for serving health.
Continue Reading
Explore more insights on AI infrastructure and distributed computing.
View All Articles