08 Cloud And Infrastructure

🧠 Cloud and Infrastructure

🎯 What You’ll Learn

Platform comparison Align AWS, GCP, and Azure ML offerings with project needs.
Match services to workloads (batch training, streaming inference, AutoML).
Understand pricing models and regional availability when architecting solutions.
Infrastructure as Code Use Terraform and Helm to standardize environments.
Treat infrastructure definitions as versioned artifacts with code reviews.
Automate deployments across dev, staging, and production with consistent templates.
Resource strategy Optimize compute, storage, and networking for ML workloads.
Right-size GPU/CPU instances, optimize data transfer, and manage ephemeral storage.
Implement data lifecycle policies to control costs while meeting retention requirements.
Security posture Apply identity, encryption, and compliance controls.
Configure IAM roles, network segmentation, and key management services.
Embed compliance checks (SOC 2, HIPAA, GDPR) into infrastructure pipelines.

📖 Overview

Cloud platforms provide elastic resources to run ML pipelines end-to-end. Choosing the right mix of managed services and self-managed clusters affects cost, agility, and governance. Infrastructure-as-Code (IaC) patterns declare environments reproducibly, so dev, staging, and prod stay synchronized. Standardized environments reduce onboarding time and prevent drift that causes deployment failures.

Containers and orchestration abstract underlying hardware, enabling hybrid or multi-cloud deployments. By codifying infrastructure, teams spin up MLOps foundations quickly while meeting enterprise security requirements. Multi-account or multi-project strategies isolate workloads, while policy-as-code enforces guardrails for cost, security, and compliance.

Infrastructure Layers

Foundation Identity, networking, VPCs, subnets, and security groups.
Compute Kubernetes clusters, serverless runtimes, VM pools, and GPU farms.
Data Plane Object storage, data lakes, warehouses, and feature stores.
Platform Services Managed ML offerings, orchestration tools, registries, and monitoring.
Delivery Layer Deployment pipelines, load balancers, API gateways, and CDN endpoints.
Governance Layer Cost management, policy enforcement, auditing, and logging.

Deployment Topologies

Single Cloud Leverage native services for simplicity and tight integration.
Multi-Region Deploy across regions for redundancy and low-latency experiences.
Multi-Cloud Combine services from different providers to avoid lock-in or meet jurisdictional constraints.
Hybrid Connect on-premises data centers with cloud for data residency or legacy system integration.

🔍 Why It Matters

Scalability Elastic compute handles spike workloads like hyperparameter sweeps.
Spot instances or preemptible VMs reduce cost during large-scale experimentation.
Cost control Visibility into resource usage prevents runaway cloud spend.
FinOps practices allocate budgets, set alerts, and prioritize optimization efforts.
Standardization IaC avoids snowflake environments that are hard to debug.
Versioned templates ensure that rollouts and recoveries follow known patterns.
Compliance Managed services simplify encryption, logging, and access policies.
Policy-as-code enforces regional data residency and audit logging requirements.
Innovation speed Developers self-service new environments safely, accelerating feature delivery.

🧰 Tools & Frameworks

Tool	Purpose
AWS SageMaker	Managed training, deployment, and monitoring.
Google Vertex AI	Unified data, training, and MLOps tooling.
Azure Machine Learning	Enterprise ML platform with governance.
Terraform	IaC to provision cloud infrastructure consistently.
Helm	Package and deploy Kubernetes resources declaratively.
Pulumi	Infrastructure automation with general-purpose languages.
Crossplane	Kubernetes-native infrastructure orchestration.
AWS CDK	Define cloud resources using familiar programming languages.
HashiCorp Vault	Centralized secrets management and dynamic credentials.
AWS Control Tower / Azure Landing Zones	Governed multi-account foundations.

Tooling Tips - Combine Terraform for foundational resources and Helm for Kubernetes workloads. - Use modules and registries to share baseline configuration (networking, logging, security). - Integrate policy-as-code tools like OPA, Conftest, or Terraform Cloud policy sets. - Automate secret rotation and injection using Vault, AWS Secrets Manager, or Azure Key Vault. - Monitor drift with Terraform Cloud, Atlantis, or open-source drift detection tools.

🧱 Architecture / Workflow Diagram

flowchart LR
A[Terraform Config] --> B[Cloud Provider APIs]
B --> C[Networking & Security]
C --> D[Compute Cluster (K8s/VMs)]
D --> E[ML Platform Services]
E --> F[Model Deployment]
F --> G[Monitoring & Logging]
G --> H[Cost & Policy Management]
H --> I[Governance Feedback]
I --> A

Diagram Walkthrough - Terraform Config defines baseline infrastructure, enforced via CI/CD pipelines. - Cloud Provider APIs provision VPCs, IAM roles, and managed services. - Networking & Security configure subnets, security groups, private endpoints, and firewall rules. - Compute Cluster hosts orchestration (Airflow, Kubeflow) and training workloads, possibly with autoscaling. - ML Platform Services include registries, feature stores, and experiment tracking integrated with compute layer. - Model Deployment uses Kubernetes services, serverless endpoints, or managed endpoints for inference. - Monitoring & Logging centralize telemetry in tools like CloudWatch, Stackdriver, or Azure Monitor. - Cost & Policy Management evaluate usage and compliance, feeding adjustments back into Terraform modules.

⚙️ Example Commands / Steps

# Provision infra with Terraform
terraform init
terraform apply -var="environment=staging"

# Deploy ML workload onto Kubernetes
helm upgrade --install feature-store charts/feast --namespace mlops

# Enforce policy checks before apply
terraform plan -out=plan.tfplan
opa eval --data policies/ --input plan.tfplan "data.terraform.deny"

# Configure workload identity on GKE
gcloud container clusters get-credentials mlops-gke
kubectl annotate serviceaccount mlflow-server iam.gke.io/gcp-service-account=mlflow@project.iam.gserviceaccount.com

Terraform Module Snippet

module "mlops_network" {
  source = "git::https://github.com/company/terraform-modules.git//networking?ref=v1.4.0"

  vpc_cidr = "10.20.0.0/16"
  subnets = [
    {
      name           = "public"
      cidr           = "10.20.0.0/20"
      availability_zones = ["us-east-1a", "us-east-1b"]
    },
    {
      name           = "private"
      cidr           = "10.20.16.0/20"
      availability_zones = ["us-east-1a", "us-east-1b"]
    }
  ]
  enable_flow_logs = true
}

📊 Example Scenario

A logistics enterprise uses Terraform to provision networking, GKE clusters, and Cloud Storage. Vertex AI handles training and model registry, while Helm deploys custom inference services across multiple regions with centralized logging in Cloud Operations. Cost dashboards in Looker surface spending anomalies, enabling FinOps teams to tweak autoscaling rules.

Extended Scenario Narrative

Week 1 Platform team establishes landing zones, VPC peering, and identity federation.
Week 2 Terraform pipelines roll out shared services: monitoring, logging, and artifact storage.
Week 3 ML engineers deploy Airflow on GKE, connect to managed Postgres, and configure DVC remote storage.
Week 4 Business domain teams clone IaC modules to launch domain-specific ML stacks.
Week 5 Incident simulation validates disaster recovery: failover to secondary region succeeds within RTO.

Additional Use Cases

Financial Services Multi-region, compliant foundations with strict IAM boundaries and auditing.
Healthcare Hybrid architecture connecting on-prem EHR systems to cloud-based ML services with secure tunnels.
Retail Edge inference clusters deployed via GitOps to hundreds of stores.
Gaming Autoscaling GPU fleets for real-time personalization during tournaments.
Manufacturing IoT gateways streaming data to cloud-based digital twins and analytics platforms.

💡 Best Practices

✅ Modularize IaC Reuse Terraform modules for repeatable environments.
Create shared libraries for networking, secrets, and observability.
✅ Implement tagging Track cost centers and projects across resources.
Enforce tagging policies via policy-as-code and automated linting.
✅ Harden security Enforce IAM least privilege and rotate secrets.
Enable audit logging and use identity federation to avoid long-lived keys.
✅ Automate scaling Use autoscaling policies for nodes and endpoints.
Combine cluster autoscaler with HPA/KPA for workloads.
✅ Apply chaos testing Validate resiliency through controlled failure injection.
✅ Document runbooks Provide standard operating procedures for infra incidents and maintenance.
✅ Monitor cost anomalies Set up budgets and alerts; hold regular FinOps reviews.

⚠️ Common Pitfalls

🚫 Manual console changes Drift appears between documented and actual state.
Restrict console access; enforce changes through pipelines.
🚫 Ignoring quotas Training jobs fail when regional limits are hit unexpectedly.
Track quota consumption and submit requests proactively.
🚫 Single-region setups Lack of redundancy leads to outages during incidents.
Deploy across regions and test failover paths.
🚫 Overprovisioned clusters Idle GPU nodes inflate cost without delivering value.
Implement auto-scaling down and scheduled shutdowns.
🚫 Missing observability Insufficient logging hampers debugging; ensure infrastructure metrics are captured.
🚫 Weak network controls Open security groups expose services; adopt zero-trust networking.

🧩 Related Topics

Previous Topic 07_Monitoring_and_Observability.md
Emphasizes the telemetry approach needed to monitor cloud resources effectively.
Next Topic 09_Advanced_MLOps.md
Builds on infrastructure foundations to deliver advanced platform capabilities.

🧭 Quick Recap

Step	Purpose
Define cloud architecture	Map workloads to managed or self-managed services.
Codify infrastructure	Use IaC to ensure consistent environments.
Secure and monitor	Enforce policies, collect telemetry, and optimize costs.
Automate governance	Apply policy-as-code to maintain compliance.
Optimize cost	Track spend and adjust scaling, storage, and resource choices.

How to Use This Recap - Review during platform design sessions to confirm critical layers are addressed. - Include in onboarding materials for engineers managing cloud ML infrastructure. - Align with FinOps and security stakeholders to validate governance coverage.

🖼️ Assets

Diagram
Storyboard Idea Depict infrastructure provisioning flow from Terraform plan to monitoring dashboards.
Dashboard Tip Visualize regional usage, cost trends, and compliance findings.

📘 References

Official docs https://registry.terraform.io/
Official docs https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines
Official docs https://learn.microsoft.com/azure/architecture/reference-architectures/ai
Blog posts https://cloud.google.com/blog/products/ai-machine-learning/mlops-in-the-cloud
Blog posts https://aws.amazon.com/blogs/architecture/tag/mlops/
Blog posts https://medium.com/microsoftazure/mlops-on-azure
GitHub examples https://github.com/aws-samples/aws-mlops-framework
GitHub examples https://github.com/GoogleCloudPlatform/mlops-on-gcp
GitHub examples https://github.com/Azure/azure-mlops-project