07 Monitoring And Observability
π§ Monitoring and Observability
π― What Youβll Learn
- Signal selection Track latency, accuracy, and drift indicators.
- Prioritize metrics that capture both system health and model performance.
- Balance leading indicators (feature drift) with lagging ones (accuracy, ROI).
- Tooling mix Combine Prometheus, Grafana, Evidently, and logging stacks.
- Integrate metrics, logs, and traces into a unified observability platform.
- Leverage model-specific tools alongside traditional APM solutions.
- Automated responses Trigger retraining or rollback based on alerts.
- Connect alert pipelines to orchestration tools for immediate remediation.
- Document runbooks so humans understand automated actions.
- Feedback loops Align monitoring data with business KPIs.
- Translate technical signals into business impact dashboards for stakeholders.
- Create feedback tickets that drive prioritization in product roadmaps.
π Overview
Monitoring keeps production models trustworthy long after deployment. Observability collects metrics, logs, and traces across data pipelines, inference services, and downstream outcomes. With the right telemetry, teams spot performance regressions, data drift, or infrastructure bottlenecks before customers notice. Effective setups provide granular views of the full ML lifecycleβfrom ingestion through prediction and business outcome measurement.
Successful configurations correlate ML-specific signals (prediction confidence, feature statistics) with platform metrics (CPU, latency) so engineers can diagnose issues quickly and act decisively. Instead of ad-hoc dashboards, observability becomes a product: alerts, runbooks, and dashboards are versioned, reviewed, and continuously improved. Teams adopt SRE practices (SLOs, error budgets) tailored for ML.
Monitoring Layers
- Data Pipeline Monitoring Ensure upstream ingestion, transformation, and feature stores remain healthy and timely.
- Model Serving Monitoring Track latency, throughput, error rates, and resource usage for inference endpoints.
- Model Performance Monitoring Compare predictions vs actual outcomes to measure accuracy, drift, and fairness.
- Business KPI Monitoring Connect model outputs to metrics like conversions, revenue, or customer satisfaction.
- User Experience Monitoring Observe how users interact with model-driven features and gather qualitative feedback.
Stakeholder Impact
- ML Engineers Tune models and pipelines based on actionable alerts.
- Data Scientists Validate hypotheses, identify drift, and propose retraining.
- Product Managers Understand business implications and prioritize roadmap adjustments.
- SRE/Platform Teams Maintain infrastructure SLOs, incident response, and capacity planning.
- Compliance Officers Access auditable logs for regulatory reporting.
π Why It Matters
- Risk mitigation Early drift detection prevents biased or stale predictions.
- Alerts trigger retraining before business impacts escalate.
- Operational uptime Alerting reduces mean time to detect and resolve incidents.
- On-call teams receive contextual notifications with recommended actions.
- Regulatory compliance Audit trails show models behave within approved ranges.
- Structured logs capture inputs, outputs, and decisions for each prediction.
- Continuous improvement Feedback data feeds future model iterations.
- Monitoring insights inform experiment backlogs and feature engineering.
- Customer trust Transparent status pages and post-incident reports build confidence.
π§° Tools & Frameworks
| Tool | Purpose |
|---|---|
| Prometheus | Time-series metrics collection and alerting. |
| Grafana | Visualization dashboards and anomaly detection panels. |
| Evidently | Statistical drift and quality reports for ML models. |
| OpenTelemetry | Standardized tracing and metrics instrumentation. |
| Seldon Alibi Detect | Real-time drift detection for deployed models. |
| Great Expectations | Data quality validation integrated into monitoring loops. |
| Loki / ELK | Centralized log aggregation and search. |
| Jaeger | Distributed tracing for inference paths. |
| PagerDuty / Opsgenie | Incident escalation and on-call coordination. |
| DataDog / New Relic | Managed APM with ML monitoring extensions. |
Integration Tips - Standardize metric names and labels for easier querying and correlation. - Use service meshes or sidecars (Envoy, Istio) to collect telemetry without modifying model code. - Enable exemplars to link traces with metrics for faster root-cause analysis. - Version dashboards and alerts in Git; review changes via pull requests. - Apply RBAC to monitoring tools to protect sensitive model metadata.
π§± Architecture / Workflow Diagram
flowchart LR
A[Prediction Service] --> B[Metrics Exporter]
A --> C[Structured Logging]
B --> D[Prometheus]
C --> E[Log Store]
D --> F[Grafana Dashboards]
E --> F
D --> G[Alert Manager]
G --> H[On-call & Automation]
H --> I[Retrain / Rollback Pipeline]
I --> J[Feature Store & Training]
J --> A
Diagram Walkthrough - Prediction Service emits metrics (latency, success rates) and logs (inputs, outputs). - Metrics Exporter collects model-specific metrics via custom exporters or OpenTelemetry SDKs. - Prometheus scrapes metrics and evaluates alert rules. - Log Store (Loki, Elasticsearch) aggregates structured logs for auditing and debugging. - Grafana Dashboards correlate metrics and logs, providing red/amber/green status. - Alert Manager routes alerts based on severity, channel, and on-call schedules. - On-call & Automation triggers incident workflows, sends notifications, or executes runbooks. - Retrain / Rollback Pipeline launches automated retraining or reverts models to previous versions. - Feature Store & Training updates features or models based on monitoring feedback, closing the loop.
βοΈ Example Commands / Steps
# prometheus alert rule
- alert: ModelLatencyHigh
expr: histogram_quantile(0.95, sum(rate(prediction_latency_bucket[5m])) by (le,model)) > 200
for: 10m
labels:
severity: page
annotations:
summary: "95th percentile latency above 200ms"
- alert: FeatureDriftDetected
expr: avg_over_time(model_feature_psi{feature="credit_score"}[1h]) > 0.2
labels:
severity: warn
annotations:
summary: "Population stability index indicates drift on credit_score"
Logging Snippet
import structlog
logger = structlog.get_logger()
def predict(request):
features = preprocess(request)
prediction = model.predict_proba(features)
logger.info(
"prediction",
customer_id=request["customer_id"],
score=float(prediction[0][1]),
model_version=current_model_version,
feature_hash=hash_features(features),
)
return prediction
Dashboard as Code Example
# grafana_dashboard.jsonnet excerpt
local grafana = import "grafonnet/grafana.libsonnet";
grafana.dashboard.new(
"ML Service Health",
panels=[
grafana.grafanaGraph.new(
title="Latency p95",
datasource="Prometheus",
targets=[
grafana.prometheusTarget.new(expr='histogram_quantile(0.95, sum(rate(prediction_latency_bucket[5m])) by (le))')
],
),
grafana.grafanaStat.new(
title="Feature Drift PSI",
datasource="Prometheus",
targets=[grafana.prometheusTarget.new(expr='model_feature_psi{feature="credit_score"}')],
),
],
)
π Example Scenario
An insurance recommendation API exports metrics with OpenTelemetry. Prometheus scrapes inference latency and accuracy, Evidently compares live feature distributions to training baselines, and Grafana alerts trigger a Prefect flow to retrain when drift persists. On-call engineers receive PagerDuty incidents with context-rich dashboards and can trigger rollback to a prior model directly from chatops.
Extended Scenario Narrative
- Day 0 Observability squad defines SLOs: latency p95 < 200 ms, drift PSI < 0.1.
- Day 7 Feature drift alert fires for
credit_score; engineers review Grafana panel to confirm trend. - Day 8 Automated Prefect flow pulls fresh labeled data, retrains the model, and updates the registry.
- Day 9 Canary deployment shows restored accuracy; monitoring dashboards confirm drift resolved.
- Day 10 Post-incident review documents lessons learned and updates alert thresholds.
Additional Use Cases
- Retail Demand Forecasting Track prediction errors vs actual sales to recalibrate models.
- Fraud Detection Monitor false positive rates and customer friction metrics to balance security and UX.
- Healthcare Diagnostics Ensure sensitivity/specificity stay within approved limits; alert clinicians when confidence drops.
- Recommendation Systems Observe engagement metrics (CTR, conversion) alongside offline accuracy.
- NLP Assistants Monitor toxicity scores, intent accuracy, and latency for multi-turn dialogues.
π‘ Best Practices
- β Monitor end-to-end Include data freshness, queue depth, and business KPIs.
- Schedule data freshness checks and tie them to incident severity levels.
- β Automate dashboards Version grafana-as-code alongside services.
- Apply Git workflows so changes are reviewed and tested before release.
- β Set SLOs Define acceptable error and latency windows per model.
- Establish error budgets that guide experimentation vs stability decisions.
- β Test alerts Simulate incidents to validate on-call readiness.
- Run game days covering drift, latency spikes, and infrastructure failures.
- β Provide context Include runbooks, charts, and suggested next steps in alert notifications.
- β Secure telemetry Anonymize and encrypt logs containing sensitive data.
- β Align with product Share monitoring insights in sprint reviews to drive roadmap updates.
β οΈ Common Pitfalls
- π« Metric overload Too many signals obscure critical anomalies.
- Periodically audit dashboards; remove unused panels and prioritize impactful metrics.
- π« No ground truth Without labeled feedback, accuracy drift goes unnoticed.
- Implement feedback loops (human review, delayed labels) to collect outcomes.
- π« Static thresholds Hard-coded alert levels miss gradual degradation.
- Use adaptive thresholds, anomaly detection, or statistical process control.
- π« Fragmented tooling Disconnected systems make root cause analysis slow.
- Consolidate observability stacks or establish integration pipelines.
- π« Manual incident response Lack of automation prolongs outages.
- Create scripts or workflows for common fixes (cache bust, fallback activation).
- π« Ignoring costs Excessive logging or high-cardinality metrics inflate bills.
- Apply sampling, rate limits, and retention policies.
π§© Related Topics
- Previous Topic
06_Model_Deployment.md - Discusses how deployment pipelines expose metrics for observability hooks.
- Next Topic
08_Cloud_and_Infrastructure.md - Covers infrastructure choices that support monitoring scalability and cost control.
π§ Quick Recap
| Step | Purpose |
|---|---|
| Collect telemetry | Gather metrics, logs, and traces. |
| Detect anomalies | Surface drift, latency, and error spikes. |
| Automate response | Trigger retraining, rollback, or incident workflows. |
| Share insights | Communicate impact to product and compliance stakeholders. |
| Iterate tooling | Improve dashboards, alerts, and runbooks continuously. |
How to Apply This Recap - Use during incident response retros to verify coverage of monitoring steps. - Incorporate into onboarding materials for new ML/SRE team members. - Align with SLO review meetings to adjust thresholds and priorities.
πΌοΈ Assets
- Diagram

- Storyboard Idea Show detection of drift leading to automated retraining and improved metrics.
- Dashboard Tip Provide interactive drill-downs that link metrics to specific model versions.
π References
- Official docs https://prometheus.io/docs/introduction/overview/
- Official docs https://grafana.com/docs/
- Official docs https://docs.evidentlyai.com/
- Blog posts https://evidentlyai.com/blog/monitoring-ml-models
- Blog posts https://mlops.community/model-monitoring
- Blog posts https://cloud.google.com/blog/topics/ai-machine-learning/ml-observability
- GitHub examples https://github.com/SeldonIO/alibi-detect
- GitHub examples https://github.com/evidentlyai/evidently
- GitHub examples https://github.com/open-telemetry/opentelemetry-python