Monitoring and Observability in Distributed ML Systems

A distributed ML training job running across 512 GPUs for several weeks represents an investment of hundreds of thousands of dollars in compute. When that job silently degrades — due to a GPU thermal throttle, a NIC that drops into a lower-bandwidth mode, a storage node experiencing elevated latency, or a subtle numerical instability causing loss divergence — without alerting, the cost can be enormous: days of wasted compute and delayed model delivery. Comprehensive observability is not optional for serious ML infrastructure teams; it is a prerequisite for operating at scale with acceptable reliability.

Observability in distributed ML systems is substantially more complex than in traditional software services. A training job simultaneously produces infrastructure telemetry (GPU utilization, memory, temperature, network throughput), ML-specific signals (training loss, gradient norms, learning rate, validation metrics), and distributed systems signals (collective operation latency, rank synchronization barriers, checkpoint durations). Correlating these signals across hundreds of nodes to diagnose a performance anomaly requires purpose-built tooling and carefully designed instrumentation.

The Three Pillars: Metrics, Logs, and Traces

The observability engineering discipline distinguishes three complementary signal types: metrics (numerical time series), logs (structured event records), and distributed traces (request-scoped call graphs). Each serves a distinct diagnostic purpose in ML systems, and a mature observability platform needs all three.

Metrics are the foundation for real-time monitoring and alerting. In ML infrastructure, metrics fall into three categories: hardware metrics (GPU utilization, memory utilization, power consumption, temperature, NVLink/InfiniBand bandwidth and error counters), system metrics (CPU utilization, memory, disk I/O, network interface throughput), and ML-domain metrics (training loss, gradient L2 norm, tokens per second, samples per second, MFU — Model FLOP Utilization). Prometheus with DCGM Exporter (for GPU metrics) is the standard stack for hardware and system metrics. ML-domain metrics are typically exported by the training framework directly through integrations with experiment tracking platforms like Weights & Biases, MLflow, or custom Prometheus metric endpoints.

Logs provide context that metrics cannot: the exact sequence of events leading to a failure, Python tracebacks, NCCL error messages, driver initialization details. Structured logging (JSON format with consistent fields: rank, job_id, timestamp, level, message) is essential for log aggregation and search at scale. A 512-GPU job generates potentially tens of thousands of log lines per second across all ranks. Centralized log aggregation (Loki, Elasticsearch, or cloud-native logging) with aggressive filtering at the agent level is necessary to manage volume without losing important diagnostic information.

Distributed traces track the execution of a single training step across all ranks, recording the duration of each phase (data loading, forward pass, backward pass, optimizer step, collective communication). Traces reveal which phase is the bottleneck and whether specific ranks are slower than others. PyTorch's torch.profiler generates detailed per-step traces in Chrome trace format or TensorBoard. At cluster scale, sampling traces (e.g., recording one detailed trace every 100 steps) provides statistical insight without the overhead of recording every step.

GPU Hardware Metrics and DCGM

NVIDIA's Data Center GPU Manager (DCGM) is the standard tool for collecting GPU telemetry in data center environments. DCGM provides approximately 150 metrics across five categories: utilization (SM utilization, memory bandwidth utilization, FP16/BF16/FP32 FLOP rate), memory (HBM utilization, ECC error counts, PCIe bandwidth), thermal (GPU temperature, memory temperature, power consumption), NVLink (NVLink data transmitted/received per link, NVLink error counts), and health status (GPU health enumeration, remapped rows for HBM ECC).

The most operationally significant metrics for diagnosing distributed training performance degradation are: SM Utilization (should be sustained above 80% for compute-intensive training — drops indicate stalls waiting for data or communication), NVLink bandwidth utilization (should be near peak during all-reduce — drops indicate communication bottlenecks or NVLink errors), GPU temperature and power draw (thermal throttling reduces clock speeds and is silent without this monitoring), and ECC error counts (correctable errors are normal; uncorrectable errors indicate hardware failure requiring node evacuation).

DCGM Exporter integrates DCGM with Prometheus, exposing GPU metrics at configurable scrape intervals (typically 5-15 seconds for operational monitoring). Grafana dashboards built on DCGM metrics provide real-time visibility into cluster health and can drive alerts via Alertmanager. Useful alert conditions include: GPU temperature exceeding 82°C (thermal throttle threshold for H100), SM utilization below 60% for more than 60 seconds on an active training job (indicating a stall), and any uncorrectable ECC errors (hardware failure).

Training Job Health Monitoring

Infrastructure metrics tell you whether the hardware is healthy, but they do not tell you whether the training run is making progress correctly. ML-domain metrics require separate instrumentation embedded in the training code. The key signals are training loss (the primary indicator of model quality progress), gradient L2 norm (large spikes indicate instability; sustained zero values indicate dead gradients or vanishing gradient problems), learning rate schedule (ensures the scheduler is advancing correctly), throughput (tokens/second or samples/second as a function of step number — drops indicate infrastructure degradation), and checkpoint duration and frequency (excessively long checkpoints can indicate storage issues or checkpoint compression problems).

Loss divergence — where training loss suddenly increases rather than decreasing — is one of the most expensive failures in large-scale training. It can result from numerical overflow, a corrupted data batch, a learning rate that is too high, or infrastructure failures that corrupt gradients. Automated anomaly detection on the loss curve, alerting when the loss increases by more than a configurable threshold over a sliding window, is an essential safeguard. Some organizations implement automated run termination and rollback to the last good checkpoint when loss divergence is detected, avoiding hours of wasted compute on a diverged run.

Weights & Biases (wandb) is the most widely adopted experiment tracking platform for ML teams, providing automatic logging of loss, metrics, system stats, and model artifacts with a rich web interface. For organizations requiring on-premises or private cloud deployments, MLflow and the open-source Aim are alternatives. The choice of experiment tracking platform should consider data residency requirements, team size, and integration with the training framework — PyTorch Lightning, Transformers, and DeepSpeed all have built-in wandb and MLflow integrations.

Distributed Training Failure Modes and Diagnosis

Distributed training failures fall into several categories that require different diagnostic approaches. Rank death — where one or more processes in the distributed job terminate unexpectedly — is detected by the remaining ranks when NCCL collective operations hang waiting for the dead rank to participate. Modern training frameworks implement heartbeat monitoring and watchdog timers to detect rank death within seconds rather than waiting for a full NCCL timeout (which defaults to 30 minutes). When a rank dies, the job typically needs to be restarted from the most recent checkpoint; automatic fault detection and restart is implemented in systems like Torchrun with elastic training support.

Silent hardware degradation is more insidious than outright failures because the job continues running but at reduced efficiency. A single GPU throttling to 80% of its clock rate due to thermal issues will slow the entire data-parallel group to 80% speed, since all ranks synchronize at each gradient accumulation step. NVLink or InfiniBand link errors that cause retransmissions degrade collective bandwidth without generating obvious failures. Detecting these requires monitoring per-GPU throughput metrics and comparing ranks — if one rank's step time is consistently higher than others, it is likely the stragglers with a hardware issue.

NCCL watchdogs can detect collective operation hangs and generate stack traces of all threads, helping identify whether the hang is waiting for a dead rank, experiencing a network timeout, or deadlocked in the application code. Setting NCCL_DEBUG=INFO captures NCCL's initialization and runtime logs, which are invaluable for diagnosing collective performance anomalies and connection failures. The combination of NCCL debug logs, DCGM metrics, and distributed traces from torch.profiler provides the information needed to diagnose most distributed training infrastructure failures.

Capacity and Cluster Utilization Monitoring

At the cluster level, monitoring shifts from per-job diagnostics to resource utilization and efficiency metrics. Key cluster-level signals include GPU-hours delivered vs scheduled (measuring scheduler efficiency), cluster-wide GPU utilization distribution (identifying nodes with chronically low utilization that may indicate hardware issues), job queue depth and wait time (measuring whether the cluster is supply-constrained), and energy efficiency (PUE, jobs/kWh).

Model FLOP Utilization (MFU) is an increasingly standard metric for measuring training efficiency, defined as the fraction of theoretical peak GPU FLOP throughput achieved by the training job. An MFU of 35-45% is typical for large-scale transformer training with current frameworks and hardware; values below 30% suggest significant inefficiencies in the training setup (excessive communication, data loading bottlenecks, or suboptimal kernel utilization). Tracking MFU across training runs and GPU generations provides an apples-to-apples comparison of infrastructure efficiency improvements over time.

Key Takeaways

Comprehensive observability requires three complementary signal types: metrics (Prometheus/DCGM), structured logs (centralized aggregation), and distributed traces (torch.profiler).
DCGM provides the GPU hardware telemetry needed to detect thermal throttling, NVLink errors, ECC failures, and utilization drops — all of which can silently degrade training performance.
ML-domain metrics (loss, gradient norm, throughput, MFU) require explicit instrumentation in training code and are best managed through experiment tracking platforms like Weights & Biases or MLflow.
Loss divergence detection and automated checkpoint rollback are essential safeguards that prevent hours of compute waste when training runs become unstable.
Per-rank throughput comparison is the fastest way to identify stragglers caused by hardware degradation — a rank consistently slower than its peers almost always has an infrastructure issue.
Model FLOP Utilization (MFU) provides a normalized efficiency metric for comparing training configurations, hardware generations, and optimization improvements.

Conclusion

Building observability into distributed ML systems from day one — rather than instrumenting after problems occur — is one of the highest-return infrastructure investments an ML platform team can make. The combination of hardware telemetry, ML-domain metrics, and distributed traces, unified in a coherent alerting and dashboarding platform, transforms training operation from a reactive firefighting exercise into a proactive, data-driven discipline. As training runs grow longer and more expensive, the ability to detect and diagnose problems quickly is directly tied to compute efficiency and ultimately to research velocity.