13 Distributed Training And Performance

🧠 Distributed Training and Performance

🎯 What You’ll Learn

Scaling basics Understand why and when to distribute training.
Training patterns Data parallel, model parallel, and hybrid strategies.
Framework tools Work with Horovod, PyTorch DDP, and TensorFlow strategies.
Hardware options Choose GPUs, TPUs, and accelerators for different workloads.
Performance tuning Optimize throughput, latency, and cost.
Real-time inference Serve fast predictions with batching and caching.

📖 Overview

Modern ML models often exceed the limits of a single machine. Distributed training spreads work across multiple devices to learn faster or handle bigger datasets. This guide uses simple language to describe the main ideas, tools, and practical tips. You will see how to set up distributed jobs, avoid common errors, and measure results.

🧠 When to Scale Out

Large datasets Training an epoch takes too long on one machine.
Wide models Memory limits prevent batching or storing all parameters.
Tight deadlines Teams need faster iteration cycles.
Serving traffic spikes Real-time predictions require horizontal scaling.
Multi-tenant platforms Shared infrastructure must handle many models at once.

🔄 Parallelism Patterns

Data parallelism Split mini-batches across workers, then average gradients.
Model parallelism Split layers or parameters across devices for giant models.
Pipeline parallelism Break model into stages and stream batches through.
Parameter server Central nodes store parameters; workers compute gradients.
Hybrid setups Combine data and model parallelism for flexibility.

🛠️ Framework Overview

Tool	Highlights
Horovod	Unified API for TensorFlow, PyTorch, MXNet. Uses ring-allreduce.
PyTorch DDP	Native module for multi-GPU and multi-node training.
TensorFlow MirroredStrategy	Easy data parallelism on a single machine.
TensorFlow MultiWorkerMirroredStrategy	Multi-node synchronous training.
DeepSpeed	Optimizations for large, sparse, or transformer models.
Megatron-LM	NVIDIA toolkit for large language models.

Selecting a Tool

Start with built-in options (PyTorch DDP, TF MirroredStrategy).
Use Horovod when you need framework flexibility or MPI clusters.
Adopt DeepSpeed or Megatron-LM for very large transformer workloads.

📦 Environment Setup

Container images Package dependencies with CUDA/cuDNN, NCCL, MPI.
Cluster managers Kubernetes, AWS SageMaker, Azure ML, or on-prem Slurm.
Networking Ensure high bandwidth (InfiniBand or 100GbE) for gradient sync.
Storage Stage datasets on shared file systems or object storage with caching.
Secrets Store access keys in Vault, AWS Secrets Manager, or Kubernetes secrets.

🧪 PyTorch DDP Example

import os
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP


def setup():
    dist.init_process_group("nccl")
    torch.cuda.set_device(int(os.environ["LOCAL_RANK"]))


def cleanup():
    dist.destroy_process_group()


def train():
    model = MyModel().cuda()
    ddp_model = DDP(model, device_ids=[int(os.environ["LOCAL_RANK"])])
    for batch in loader:
        loss = ddp_model(batch["x"], batch["y"])  # simplified call
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

if __name__ == "__main__":
    setup()
    train()
    cleanup()

🌐 Horovod Quick Start

Install with GPU support: HOROVOD_WITH_PYTORCH=1 pip install horovod[pytorch].
Launch training: horovodrun -np 8 python train.py.
Horovod handles gradient averaging and scaling of learning rates automatically.

📈 Scaling Rules of Thumb

Linear scaling Double workers, double batch size, adjust learning rate.
Warmup Use learning rate warmup to stabilize large batch training.
Gradient clipping Prevent exploding gradients when syncing many workers.
Checkpointing Save state regularly to recover from node failures.
Fault tolerance Use elastic training APIs to handle lost workers.

⚡ Performance Tuning Checklist

Profile I/O Ensure GPUs stay busy by prefetching data.
Mixed precision Use FP16 or BF16 to speed math and reduce memory.
Gradient accumulation Simulate large batches when memory is limited.
Pinned memory Enable pinned host memory for faster CPU-GPU transfers.
CUDA streams Overlap computation and data transfers.
Tensor fusion Combine small tensors before allreduce to reduce overhead.

🧮 Metrics to Track

Throughput Samples per second per worker.
Speedup Compare runtime vs single GPU baseline.
Scaling efficiency Speedup divided by number of workers.
GPU utilization Monitor with nvidia-smi or Prometheus exporters.
Network bandwidth Watch for bottlenecks on interconnect.
Loss curves Ensure convergence does not degrade at scale.

🛰️ Hardware Primer

GPUs Most common accelerators; choose memory size based on model.
TPUs Google Cloud offering with strong performance for TensorFlow/JAX.
AWS Trainium Custom chips for high throughput training.
FPGAs Flexible but require specialized development.
Multi-GPU boxes NVLink or NVSwitch machines for strong intra-node bandwidth.

🧾 Kubernetes Patterns

Operators Kubeflow Training Operator, Volcano, or Ray for distributed jobs.
CRDs Submit PyTorchJob or TFJob manifests describing workers and masters.
Storage mounts Use PVCs or object storage gateways for datasets.
Autoscaling Scale pods based on queue length or custom metrics.
Spot instances Save cost with interruption-tolerant setups and checkpoints.

📊 Example `PyTorchJob`

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: bert-pretraining
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      template:
        spec:
          containers:
          - name: pytorch
            image: registry/bert:latest
            args: ["python", "train.py"]
            resources:
              limits:
                nvidia.com/gpu: 1
    Worker:
      replicas: 7
      template:
        spec:
          containers:
          - name: pytorch
            image: registry/bert:latest
            args: ["python", "train.py"]
            resources:
              limits:
                nvidia.com/gpu: 1

🧰 Real-Time Inference Strategies

Batching Group requests to reuse GPU kernels.
Autoscaling Scale pods based on CPU/GPU metrics or queue length.
Model caching Keep hot models in memory; lazy load rarely used ones.
Compiled runtimes Use TensorRT, ONNX Runtime, or TorchScript for speed.
Edge serving Deploy lightweight models closer to users.
Canary releases Gradually route traffic to new versions.

🧠 Optimization Techniques

Quantization Reduce precision (INT8) to shrink models and speed inference.
Pruning Remove weights with minimal impact on accuracy.
Knowledge distillation Train smaller students from larger teachers.
Operator fusion Combine operations to reduce overhead.
Caching features Store expensive features in Redis for reuse.

📉 Troubleshooting Guide

Diverging loss Check learning rate, gradient clipping, and data distribution.
Hanging job Inspect networking, NCCL settings, and GPU driver versions.
OOM errors Reduce batch size, enable gradient checkpointing, or shard model.
Slow allreduce Upgrade interconnect, tweak NCCL algorithms, balance workloads.
Unbalanced workers Ensure each worker reads unique data shards.
Inconsistent results Set seeds, synchronize random states, and log versions.

🧮 Cost Management

Spot/Preemptible instances Cheap but require checkpoint resilience.
Right sizing Match instance types to model memory and compute needs.
Schedule windows Train during off-peak hours for lower cloud rates.
Multi-cluster strategy Use on-prem for steady workloads, cloud for bursts.
Utilization tracking Alert when GPUs sit idle in clusters.

🧪 Benchmarking Workflow

Measure single GPU baseline with sample data.
Scale to 2, 4, 8 GPUs and record throughput and accuracy.
Tune batch size and learning rate per step.
Profile data pipeline and fix bottlenecks.
Document results, configs, and environment details in version control.

📊 Example Scenario

A language team pre-trains a transformer on billions of tokens. They use PyTorch DDP across 16 GPUs on two Kubernetes nodes. Data shards live in an object store and stream through a caching layer. Mixed precision doubles throughput. Checkpoints write every hour to protect against spot instance losses. After training, they export the model to ONNX Runtime for low-latency inference with Triton. Dashboards track GPU utilization, loss curves, and cost per step.

💡 Best Practices

✅ Automate setup Use scripts or Helm charts to launch distributed jobs.
✅ Monitor closely Collect metrics from GPUs, CPUs, and network devices.
✅ Document configs Keep YAML manifests and parameter files in Git.
✅ Reproduce runs Record random seeds, library versions, and data hashes.
✅ Security first Restrict cluster access and rotate credentials.
✅ Continuous learning Review latest accelerator features and libraries.

⚠️ Common Pitfalls

🚫 Ignoring I/O Fast GPUs starve without matching data pipelines.
🚫 Manual scaling Hard-coded world sizes break when hardware changes.
🚫 No fault plan Jobs restart from scratch after a single worker failure.
🚫 Hidden costs Leaving large GPU clusters running after jobs finish.
🚫 Lack of testing Deploying new distributed code without small-scale checks.

🧩 Related Topics

Previous Topic 12_Data_Engineering_and_Big_Data.md
Next Topic 14_Soft_Skills_and_Collaboration.md

🧭 Quick Recap

Step	Purpose
Choose pattern	Match data, model, and resource needs.
Set up cluster	Configure containers, drivers, and networking.
Tune performance	Keep hardware busy and models stable.
Monitor & cost	Track metrics, handle incidents, optimize spend.
Serve fast	Deploy optimized models for real-time workloads.

🖼️ Assets

Diagram Training cluster showing workers, parameter sync, and storage.
Checklist Pre-flight list before launching multi-node jobs.

📘 References

Official docs https://pytorch.org/tutorials/intermediate/ddp_tutorial.html
Official docs https://www.tensorflow.org/guide/distributed_training
Official docs https://horovod.ai/docs/
Blog posts https://developer.nvidia.com/blog
Blog posts https://aws.amazon.com/blogs/machine-learning/
GitHub examples https://github.com/pytorch/examples/tree/main/distributed
GitHub examples https://github.com/horovod/horovod