06 Model Deployment

🧠 Model Deployment

🎯 What You’ll Learn

Deployment options Compare batch, online, and streaming deployments.
Understand when batch scoring and asynchronous processing reduce infrastructure costs.
Evaluate latency-sensitive use cases that need online serving and low tail latencies.
Integrate streaming deployments for real-time feature updates and event-driven inference.
Serving stacks Combine FastAPI, TorchServe, and Docker efficiently.
Select frameworks suited to your model format (ONNX, PyTorch, TensorFlow).
Compose microservices that wrap prediction logic with logging and validation.
Optimize container images for fast startup and minimal footprint.
Operationalization Package models with dependencies and configs.
Convert notebooks into production-ready apps with reproducible environments.
Externalize parameters, secrets, and feature contracts.
Automate promotion from staging to prod with approvals.
Scalability Use Kubernetes and autoscaling for reliable inference.
Configure horizontal pod autoscalers (HPA) and KNative for dynamic scaling.
Employ GPU scheduling and model sharding for heavy workloads.
Plan for edge deployments when latency and connectivity require local inference.

📖 Overview

Model deployment turns trained artifacts into live services that power applications. It requires packaging models with feature transformations, exposing APIs, and managing infrastructure. Whether batch scoring or real-time serving, a robust deployment flow handles rollouts, scaling, and observability. Deployment involves not only the model but also feature preprocessing, post-processing, and integration with upstream and downstream systems.

Modern stacks rely on containerization to ensure parity across environments. Service frameworks define endpoints, while orchestration platforms manage replicas, scaling policies, and updates without downtime. Teams also integrate CI/CD pipelines, security controls, and monitoring to keep services healthy. As models evolve, deployment practices must support rapid iteration without sacrificing reliability, traceability, or compliance.

Deployment Modalities

Batch Serving Run models on schedules to score large data sets, storing results for downstream use.
Online Serving Provide synchronous APIs for real-time predictions, optimizing for latency and SLA compliance.
Streaming / Event-Driven Consume message queues, process events, and emit responses with minimal delay.
Edge Deployment Ship models to on-device runtimes for disconnected or low-latency scenarios.
Hybrid Approaches Mix modalities (e.g., online for critical requests, batch for analytics) to balance cost and performance.

Team Considerations

Data Scientists Collaborate on inference code, monitor performance, and propose model updates.
ML Engineers Own packaging, deployment automation, and runtime optimization.
Platform Engineers Manage infrastructure, observability, and security baselines.
Product & Compliance Stakeholders Review model behavior, guardrails, and customer impact before release.

🔍 Why It Matters

Business impact Deployed models directly influence customer experiences.
Personalized recommendations, fraud detection, and automation rely on timely inference.
Stability Reproducible packaging avoids environment drift.
Containerized builds ensure the same dependencies exist across stages.
Standardized base images simplify patching and security updates.
Scalability Autoscaling ensures consistent latency under variable loads.
Queueing and rate-limiting keep services responsive during traffic bursts.
Multi-region deployments provide resilience against zonal failures.
Maintainability Structured rollouts simplify updates and rollback.
GitOps workflows record every change and allow quick reversions.
Feature flags enable progressive delivery and A/B testing.
Compliance Traceable deployment history supports audits, fairness checks, and incident response.

🧰 Tools & Frameworks

Tool	Purpose
FastAPI	Lightweight REST APIs for serving predictions.
TorchServe	Production-ready PyTorch model hosting.
Docker	Containerize models and dependencies.
Kubernetes	Orchestrate scalable inference services.
KServe	Kubernetes-native model serving abstraction.
BentoML	Bundle models with standardized server scaffolding.
Seldon Core	Deploy multi-model graphs with canary routing.
Triton Inference Server	High-performance serving for GPUs and multiple frameworks.
AWS SageMaker	Managed model deployment with autoscaling endpoints.
Azure ML Endpoints	Managed real-time and batch inference on Azure.

Selection Guidance - Map tooling to framework compatibility, latency requirements, and compliance needs. - Use managed services for rapid prototyping; transition to Kubernetes for customization and cost control. - Evaluate ecosystem integrations: logging, monitoring, A/B testing, and feature stores. - Factor in cost of GPUs, memory, and scaling policies; leverage spot instances or autoscaling to optimize. - Ensure security features exist: TLS, IAM, VPC integration, and secrets management.

🧱 Architecture / Workflow Diagram

flowchart LR
A[Model Registry] --> B[CI Packaging]
B --> C[Docker Image]
C --> D[Kubernetes Deployment]
D --> E[Service Endpoint]
E --> F[Client Applications]
D --> G[Monitoring]
G --> H[Autoscaling & Alerts]
H --> B

Diagram Walkthrough - Pull versioned models and metadata from the registry with approvals. - Package inference code, dependencies, and configuration into immutable containers. - Deploy to Kubernetes (or serverless platform), applying policies for resources and security. - Serve predictions via HTTP/gRPC endpoints, message queues, or streaming connectors. - Monitor service metrics (latency, error rates) and model metrics (accuracy, drift). - Autoscaling policies react to load, while alerts inform on-call engineers of anomalies. - Monitoring insights feed back into CI packaging to trigger retraining or rollback.

⚙️ Example Commands / Steps

# Build and run a FastAPI model server
docker build -t registry.io/churn-service:1.0 .
docker push registry.io/churn-service:1.0
kubectl apply -f k8s/deploy.yaml
kubectl rollout status deployment/churn-service

# Register model in MLflow for traceability
mlflow models register -m runs:/prod-ready/model -n churn-service

# Update KServe InferenceService definition
kubectl apply -f k8s/kserve/churn.yaml

# Perform smoke test against staging endpoint
curl -X POST https://staging.api/predict -H "Content-Type: application/json" -d '{"customer_id": "123"}'

Configuration Snippet

# k8s/deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: churn-service
  labels:
    app: churn-service
spec:
  replicas: 3
  selector:
    matchLabels:
      app: churn-service
  template:
    metadata:
      labels:
        app: churn-service
    spec:
      containers:
        - name: predictor
          image: registry.io/churn-service:1.0
          ports:
            - containerPort: 8080
          envFrom:
            - secretRef:
                name: churn-service-secrets
          resources:
            requests:
              cpu: "500m"
              memory: "1Gi"
            limits:
              cpu: "2"
              memory: "4Gi"
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10

Autoscaling Policy

# k8s/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: churn-service-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: churn-service
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60

📊 Example Scenario

A fintech startup packages its credit risk model into a FastAPI app. Docker images ship through GitHub Actions, Kubernetes handles blue/green rollouts, and TorchServe manages GPU-serving for resource-intensive transformers. The team uses KServe for traffic splitting between model versions and logs predictions to BigQuery for auditing. After deployment, observability panels reveal slight latency spikes during market open, prompting autoscaling adjustments.

Extended Scenario Narrative

Sprint 1 Convert training notebooks into a FastAPI app with structured logging and validation middleware.
Sprint 2 Build container images using multi-stage Dockerfiles, integrate vulnerability scanning, and push artifacts to a private registry.
Sprint 3 Configure GitHub Actions to run integration tests, load tests, and contract checks before deploying to staging.
Sprint 4 Implement blue/green deployment via KServe; shadow traffic validates metrics before customer exposure.
Sprint 5 Monitor production metrics in Grafana, set SLOs for latency and accuracy, and schedule retraining when degradation occurs.

Additional Use Cases

Healthcare Diagnostics Deploy explainable AI services with HIPAA-compliant logging and audit trails.
Industrial IoT Run models on edge gateways; sync weight updates to the cloud periodically.
Retail Forecasting Provide API endpoints that integrate with ERP systems for inventory planning.
Cybersecurity Stream anomaly detection models with Kafka consumers and Flink for enrichment.
Conversational AI Deploy multi-channel bots using serverless endpoints and session-aware caching.

💡 Best Practices

✅ Externalize configs Use environment variables or Consul for secrets.
Parameterize endpoints, feature stores, and threshold values.
Rotate credentials automatically using vault integrations.
✅ Automate health checks Configure readiness and liveness probes.
Include application-level checks for dependencies (feature store, database connectivity).
✅ Log predictions Persist inputs and outputs for debugging and auditing.
Redact sensitive fields and comply with privacy regulations.
✅ Test inference Validate latency and correctness before promoting.
Run load tests with Locust/k6 and scenario-specific tests for critical segments.
✅ Monitor costs Track compute usage per endpoint and optimize scaling rules.
✅ Document runbooks Provide on-call teams with rollback steps, dashboards, and contact lists.
✅ Implement circuit breakers Fail fast when downstream services misbehave, protecting uptime.

⚠️ Common Pitfalls

🚫 Hardcoded paths Inline file paths break in containerized environments.
Use relative paths or volume mounts, and document expected directory structure.
🚫 Ignoring dependency pinning Version mismatches cause runtime errors.
Lock dependencies via poetry.lock, requirements.txt, or image digests.
🚫 All-in-one containers Monolithic images impede scaling and updates.
Separate preprocessing, inference, and post-processing into composable services.
🚫 Manual deployments Skipping automation increases risk of missed steps and broken rollbacks.
Adopt GitOps so infra changes and releases are declarative and reviewed.
🚫 Missing observability Without metrics and traces, diagnosing issues is painful.
Embed OpenTelemetry instrumentation and standardize log formats.
🚫 No fallbacks Lack of fallback models or heuristics causes total outages if inference fails.
Provide cached predictions or rule-based systems for continuity.

🧩 Related Topics

Previous Topic 05_Workflow_Orchestration.md
Orchestrated pipelines deliver artifacts ready for deployment.
Next Topic 07_Monitoring_and_Observability.md
Monitoring ensures deployed models remain healthy and accurate.

🧭 Quick Recap

Step	Purpose
Package model	Bundle artifact, code, and dependencies reproducibly.
Deploy service	Expose prediction endpoints via scalable infrastructure.
Monitor runtime	Track latency, errors, and resource usage.
Automate scaling	Maintain performance during traffic surges.
Govern releases	Document approvals, rollbacks, and audit trails.

How to Apply This Recap - Use in deployment readiness reviews to confirm checklist completion. - Share with operations teams to align on responsibilities for incident response. - Convert steps into onboarding documentation for new ML engineers.

🖼️ Assets

Diagram
Storyboard Idea Show progression from registry to live endpoint with feedback loop.
Dashboard Tip Display latency percentiles, error counts, and autoscaling events.

📘 References

Official docs https://kserve.github.io/website/latest/modelserving/overview/
Official docs https://docs.seldon.io/projects/seldon-core/en/latest/
Official docs https://docs.bentoml.org/
Blog posts https://towardsdatascience.com/scalable-ml-model-deployments
Blog posts https://netflixtechblog.com/ml-in-production
Blog posts https://aws.amazon.com/blogs/machine-learning/category/machine-learning/mlops
GitHub examples https://github.com/pytorch/serve/tree/master/examples
GitHub examples https://github.com/bentoml/BentoML
GitHub examples https://github.com/kserve/kserve/tree/master/docs/samples