13 Distributed Training And Performance
๐ง Distributed Training and Performance
๐ฏ What Youโll Learn
- Scaling basics Understand why and when to distribute training.
- Training patterns Data parallel, model parallel, and hybrid strategies.
- Framework tools Work with Horovod, PyTorch DDP, and TensorFlow strategies.
- Hardware options Choose GPUs, TPUs, and accelerators for different workloads.
- Performance tuning Optimize throughput, latency, and cost.
- Real-time inference Serve fast predictions with batching and caching.
๐ Overview
Modern ML models often exceed the limits of a single machine. Distributed training spreads work across multiple devices to learn faster or handle bigger datasets. This guide uses simple language to describe the main ideas, tools, and practical tips. You will see how to set up distributed jobs, avoid common errors, and measure results.
๐ง When to Scale Out
- Large datasets Training an epoch takes too long on one machine.
- Wide models Memory limits prevent batching or storing all parameters.
- Tight deadlines Teams need faster iteration cycles.
- Serving traffic spikes Real-time predictions require horizontal scaling.
- Multi-tenant platforms Shared infrastructure must handle many models at once.
๐ Parallelism Patterns
- Data parallelism Split mini-batches across workers, then average gradients.
- Model parallelism Split layers or parameters across devices for giant models.
- Pipeline parallelism Break model into stages and stream batches through.
- Parameter server Central nodes store parameters; workers compute gradients.
- Hybrid setups Combine data and model parallelism for flexibility.
๐ ๏ธ Framework Overview
| Tool | Highlights |
|---|---|
| Horovod | Unified API for TensorFlow, PyTorch, MXNet. Uses ring-allreduce. |
| PyTorch DDP | Native module for multi-GPU and multi-node training. |
| TensorFlow MirroredStrategy | Easy data parallelism on a single machine. |
| TensorFlow MultiWorkerMirroredStrategy | Multi-node synchronous training. |
| DeepSpeed | Optimizations for large, sparse, or transformer models. |
| Megatron-LM | NVIDIA toolkit for large language models. |
Selecting a Tool
- Start with built-in options (PyTorch DDP, TF MirroredStrategy).
- Use Horovod when you need framework flexibility or MPI clusters.
- Adopt DeepSpeed or Megatron-LM for very large transformer workloads.
๐ฆ Environment Setup
- Container images Package dependencies with CUDA/cuDNN, NCCL, MPI.
- Cluster managers Kubernetes, AWS SageMaker, Azure ML, or on-prem Slurm.
- Networking Ensure high bandwidth (InfiniBand or 100GbE) for gradient sync.
- Storage Stage datasets on shared file systems or object storage with caching.
- Secrets Store access keys in Vault, AWS Secrets Manager, or Kubernetes secrets.
๐งช PyTorch DDP Example
import os
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
def setup():
dist.init_process_group("nccl")
torch.cuda.set_device(int(os.environ["LOCAL_RANK"]))
def cleanup():
dist.destroy_process_group()
def train():
model = MyModel().cuda()
ddp_model = DDP(model, device_ids=[int(os.environ["LOCAL_RANK"])])
for batch in loader:
loss = ddp_model(batch["x"], batch["y"]) # simplified call
loss.backward()
optimizer.step()
optimizer.zero_grad()
if __name__ == "__main__":
setup()
train()
cleanup()
๐ Horovod Quick Start
- Install with GPU support:
HOROVOD_WITH_PYTORCH=1 pip install horovod[pytorch]. - Launch training:
horovodrun -np 8 python train.py. - Horovod handles gradient averaging and scaling of learning rates automatically.
๐ Scaling Rules of Thumb
- Linear scaling Double workers, double batch size, adjust learning rate.
- Warmup Use learning rate warmup to stabilize large batch training.
- Gradient clipping Prevent exploding gradients when syncing many workers.
- Checkpointing Save state regularly to recover from node failures.
- Fault tolerance Use elastic training APIs to handle lost workers.
โก Performance Tuning Checklist
- Profile I/O Ensure GPUs stay busy by prefetching data.
- Mixed precision Use FP16 or BF16 to speed math and reduce memory.
- Gradient accumulation Simulate large batches when memory is limited.
- Pinned memory Enable pinned host memory for faster CPU-GPU transfers.
- CUDA streams Overlap computation and data transfers.
- Tensor fusion Combine small tensors before allreduce to reduce overhead.
๐งฎ Metrics to Track
- Throughput Samples per second per worker.
- Speedup Compare runtime vs single GPU baseline.
- Scaling efficiency Speedup divided by number of workers.
- GPU utilization Monitor with
nvidia-smior Prometheus exporters. - Network bandwidth Watch for bottlenecks on interconnect.
- Loss curves Ensure convergence does not degrade at scale.
๐ฐ๏ธ Hardware Primer
- GPUs Most common accelerators; choose memory size based on model.
- TPUs Google Cloud offering with strong performance for TensorFlow/JAX.
- AWS Trainium Custom chips for high throughput training.
- FPGAs Flexible but require specialized development.
- Multi-GPU boxes NVLink or NVSwitch machines for strong intra-node bandwidth.
๐งพ Kubernetes Patterns
- Operators Kubeflow Training Operator, Volcano, or Ray for distributed jobs.
- CRDs Submit
PyTorchJoborTFJobmanifests describing workers and masters. - Storage mounts Use PVCs or object storage gateways for datasets.
- Autoscaling Scale pods based on queue length or custom metrics.
- Spot instances Save cost with interruption-tolerant setups and checkpoints.
๐ Example PyTorchJob
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: bert-pretraining
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
template:
spec:
containers:
- name: pytorch
image: registry/bert:latest
args: ["python", "train.py"]
resources:
limits:
nvidia.com/gpu: 1
Worker:
replicas: 7
template:
spec:
containers:
- name: pytorch
image: registry/bert:latest
args: ["python", "train.py"]
resources:
limits:
nvidia.com/gpu: 1
๐งฐ Real-Time Inference Strategies
- Batching Group requests to reuse GPU kernels.
- Autoscaling Scale pods based on CPU/GPU metrics or queue length.
- Model caching Keep hot models in memory; lazy load rarely used ones.
- Compiled runtimes Use TensorRT, ONNX Runtime, or TorchScript for speed.
- Edge serving Deploy lightweight models closer to users.
- Canary releases Gradually route traffic to new versions.
๐ง Optimization Techniques
- Quantization Reduce precision (INT8) to shrink models and speed inference.
- Pruning Remove weights with minimal impact on accuracy.
- Knowledge distillation Train smaller students from larger teachers.
- Operator fusion Combine operations to reduce overhead.
- Caching features Store expensive features in Redis for reuse.
๐ Troubleshooting Guide
- Diverging loss Check learning rate, gradient clipping, and data distribution.
- Hanging job Inspect networking, NCCL settings, and GPU driver versions.
- OOM errors Reduce batch size, enable gradient checkpointing, or shard model.
- Slow allreduce Upgrade interconnect, tweak NCCL algorithms, balance workloads.
- Unbalanced workers Ensure each worker reads unique data shards.
- Inconsistent results Set seeds, synchronize random states, and log versions.
๐งฎ Cost Management
- Spot/Preemptible instances Cheap but require checkpoint resilience.
- Right sizing Match instance types to model memory and compute needs.
- Schedule windows Train during off-peak hours for lower cloud rates.
- Multi-cluster strategy Use on-prem for steady workloads, cloud for bursts.
- Utilization tracking Alert when GPUs sit idle in clusters.
๐งช Benchmarking Workflow
- Measure single GPU baseline with sample data.
- Scale to 2, 4, 8 GPUs and record throughput and accuracy.
- Tune batch size and learning rate per step.
- Profile data pipeline and fix bottlenecks.
- Document results, configs, and environment details in version control.
๐ Example Scenario
A language team pre-trains a transformer on billions of tokens. They use PyTorch DDP across 16 GPUs on two Kubernetes nodes. Data shards live in an object store and stream through a caching layer. Mixed precision doubles throughput. Checkpoints write every hour to protect against spot instance losses. After training, they export the model to ONNX Runtime for low-latency inference with Triton. Dashboards track GPU utilization, loss curves, and cost per step.
๐ก Best Practices
- โ Automate setup Use scripts or Helm charts to launch distributed jobs.
- โ Monitor closely Collect metrics from GPUs, CPUs, and network devices.
- โ Document configs Keep YAML manifests and parameter files in Git.
- โ Reproduce runs Record random seeds, library versions, and data hashes.
- โ Security first Restrict cluster access and rotate credentials.
- โ Continuous learning Review latest accelerator features and libraries.
โ ๏ธ Common Pitfalls
- ๐ซ Ignoring I/O Fast GPUs starve without matching data pipelines.
- ๐ซ Manual scaling Hard-coded world sizes break when hardware changes.
- ๐ซ No fault plan Jobs restart from scratch after a single worker failure.
- ๐ซ Hidden costs Leaving large GPU clusters running after jobs finish.
- ๐ซ Lack of testing Deploying new distributed code without small-scale checks.
๐งฉ Related Topics
- Previous Topic
12_Data_Engineering_and_Big_Data.md - Next Topic
14_Soft_Skills_and_Collaboration.md
๐งญ Quick Recap
| Step | Purpose |
|---|---|
| Choose pattern | Match data, model, and resource needs. |
| Set up cluster | Configure containers, drivers, and networking. |
| Tune performance | Keep hardware busy and models stable. |
| Monitor & cost | Track metrics, handle incidents, optimize spend. |
| Serve fast | Deploy optimized models for real-time workloads. |
๐ผ๏ธ Assets
- Diagram Training cluster showing workers, parameter sync, and storage.
- Checklist Pre-flight list before launching multi-node jobs.
๐ References
- Official docs https://pytorch.org/tutorials/intermediate/ddp_tutorial.html
- Official docs https://www.tensorflow.org/guide/distributed_training
- Official docs https://horovod.ai/docs/
- Blog posts https://developer.nvidia.com/blog
- Blog posts https://aws.amazon.com/blogs/machine-learning/
- GitHub examples https://github.com/pytorch/examples/tree/main/distributed
- GitHub examples https://github.com/horovod/horovod