02 Data Versioning And Management

🧠 Data Versioning and Management

🎯 What You’ll Learn

Snapshot strategy Capture reliable dataset versions for experiments.
Design branching models that mirror software release patterns (dev, staging, prod).
Document retention policies for raw vs curated data, including legal hold requirements.
Link dataset snapshots to experiment IDs so analysts can reproduce results on demand.
Storage design Organize raw, staging, and curated layers with traceability.
Separate landing, processing, and gold zones to isolate data quality responsibilities.
Use data contracts to define schemas, SLAs, and acceptable value ranges per layer.
Structure folder hierarchies and table partitions for auditable, cost-effective storage.
Governance tactics Enforce access controls, lineage, and cataloging.
Integrate RBAC, column masking, and encryption policies across pipelines.
Maintain lineage graphs that connect datasets to upstream sources and downstream models.
Automate stewardship workflows for reviews, approvals, and data expiration.
Tool fit Match DVC, LakeFS, and Delta Lake to workload needs.
Evaluate CPU, network, and metadata overhead per tool based on data volume and concurrency.
Combine tools (e.g., DVC for smaller datasets, Delta Lake for petabyte-scale tables).
Plan migration paths from manual storage buckets to governed lakehouse architectures.

📖 Overview

Data fuels every ML lifecycle stage, so uncontrolled copies can collapse reproducibility. Data versioning extends Git-style workflows to petabyte-scale storage, letting teams branch, tag, and merge datasets while maintaining history. Consistent versioning empowers teams to re-run experiments, audit model decisions, and debug anomalies without guesswork.

Management adds automation: quality checks, metadata catalogs, and lifecycle policies that ensure reliable inputs for training and serving. Together they unlock confident experimentation and faster compliance reporting. Data engineers, scientists, and governance teams gain confidence that the “dataset behind the model” can always be retrieved, inspected, and validated. This is especially critical when auditing for fairness, privacy, or regulatory compliance.

In real-world environments, data versioning is not just about storing files; it’s about aligning people, processes, and technology. Versioning integrates with CI/CD to validate new data, orchestrates with feature stores to maintain online/offline consistency, and couples with monitoring to trigger corrective actions when drift appears. Robust management bridges the gap between raw ingestion and production-grade, trustworthy datasets.

Key Concepts

Immutable History Treat datasets like immutable logs: append-only with time travel for debugging.
Branching Workflows Create experiment branches to test transformations without risk to production tables.
Metadata Richness Store context such as schema versions, owner contacts, and data quality scores alongside data.
Policy Enforcement Encode governance rules in code, ensuring every data pull request runs automated checks.
Discoverability Provide searchable catalogs so teams reuse vetted datasets instead of rebuilding from scratch.

Stakeholder Involvement

Data Engineers Build ingestion pipelines, manage storage formats, and enforce contracts.
ML Engineers Consume curated datasets, contribute transformations, and validate reproducibility.
Data Stewards Curate metadata, manage access policies, and oversee compliance.
Security & Compliance Audit access logs, ensure encryption, and collaborate on incident response.
Product & Business Analysts Request dataset enhancements and verify that data aligns with business definitions.

🔍 Why It Matters

Reproducibility Model outcomes are traceable to exact data snapshots.
Investigations can rerun historical experiments with the same dataset fingerprint.
Incident response time drops because teams identify the precise data revision causing issues.
Team collaboration Multiple squads explore without overwriting each other.
Branching keeps experimental transformations isolated until they pass validation gates.
Merge workflows record review approvals, comments, and automated validation results.
Audit readiness Access logs and lineage simplify regulatory responses.
Compliance teams can answer “which data trained this model?” within minutes.
Legal holds freeze relevant dataset versions during investigations or litigation.
Quality assurance Automated validation catches schema drift early.
Schema changes trigger alerts before hitting production scoring endpoints.
Statistical tests flag anomalies in feature distributions, reducing model performance degradation.
Cost efficiency Optimized storage tiers and retention policies prevent overspending.
Lifecycle rules archive or delete stale snapshots while preserving critical versions.
Business trust Clear data provenance drives confidence in analytics and ML-driven decisions.

🧰 Tools & Frameworks

Tool	Purpose
DVC	Track dataset pointers in Git and push content to remotes.
LakeFS	Git-like branching and merging on object stores.
Delta Lake	ACID tables with time travel for lakehouses.
Great Expectations	Declarative data quality validation.
Feast	Centralized feature store with historical retrieval.
Apache Hudi	Incremental data management with upserts and clustering.
DataHub	Metadata catalog for lineage and discovery.
Amundsen	Data discovery portal for analysts and engineers.
Iceberg	Open table format with schema evolution and partitioning.
OpenLineage	Standardized lineage metadata interchange.

Selection Guidance - Data Volume For terabytes+ of data, prefer lakehouse formats (Delta Lake, Iceberg). - Branching Needs Use LakeFS when you need Git-like branching on S3/GCS/Azure Blob. - Pipeline Integration Combine Great Expectations with orchestrators to enforce checks. - Feature Consistency Pair Feast with versioned offline stores to prevent training-serving skew. - Metadata Federations Integrate DataHub or Amundsen to unify metadata across warehouses, lakes, and BI tools. - Governance Frameworks Adopt OpenLineage or Marquez to standardize lineage collection across platforms. - Latency Requirements For near-real-time updates, evaluate Hudi or Delta streaming features.

🧱 Architecture / Workflow Diagram

flowchart LR
A[Raw Landing Zone] --> B[Validation & Quality Checks]
B --> C{Branch}
C -->|Experiment| D[Sandbox Snapshot]
C -->|Production| E[Curated Gold Tables]
D --> F[Model Training]
E --> G[Serving Layer]
F --> H[Lineage Catalog]
G --> H

Diagram Walkthrough - Raw Landing Zone Receives data dumps, streaming events, or third-party feeds. Access is tightly controlled to prevent accidental use of unvalidated data. - Validation & Quality Checks Apply Great Expectations or dbt tests to enforce schema, nullability, and business rules. - Branch Decision LakeFS or DVC branches capture isolated copies. Experiments can iterate safely without touching production. - Sandbox Snapshot Researchers explore features, create derived datasets, and run modeling notebooks against this branch. - Curated Gold Tables Production-ready datasets are optimized, deduplicated, and partitioned for downstream consumption. - Model Training Pipelines reference explicit dataset tags to ensure consistent training inputs. - Serving Layer Feature stores or data services deliver versioned data to production systems. - Lineage Catalog Logs every transformation, dataset, and consumer, enabling audits and impact analysis.

⚙️ Example Commands / Steps

# Track and push a dataset snapshot
mkdir -p data/raw && cp source/customers.parquet data/raw/
dvc add data/raw/customers.parquet
git add data/raw/.gitignore data/raw/customers.parquet.dvc
git commit -m "Add raw customers snapshot"
dvc push -r s3-store

# Create a LakeFS experimental branch
lakefs branch create churn-exp --source prod

# Run quality checks before merging to prod branch
great_expectations checkpoint run checkpoints/customers.yml

# Promote curated table to Delta Lake time-travel version
spark-submit jobs/promote_curated.py --version-tag "2025-10-12"

Configuration Snippet

# data_lineage_policy.yaml
lineage_tracking:
  enabled: true
  tool: openlineage
  destinations:
    - type: datahub
      endpoint: https://datahub.company/api
quality_checks:
  required:
    - name: column_null_check
      threshold: 0
    - name: schema_version_match
      expected_version: 12
storage_layers:
  raw:
    bucket: s3://company-raw
    retention_days: 30
  staging:
    bucket: s3://company-staging
    retention_days: 90
  curated:
    bucket: s3://company-gold
    retention_days: 365
security:
  encryption: SSE-KMS
  iam_roles:
    - name: mlops-training-role
      permissions: ["s3:GetObject", "lakefs:Read"]
    - name: data-steward-role
      permissions: ["lakefs:Write", "catalog:Update"]

📊 Example Scenario

A fraud detection program stores events in Delta Lake. Analysts create LakeFS branches for experiments, run Great Expectations suites, and confirm lineage in DataHub before training gradient boosting models on fresh snapshots. When a regulator requests evidence, the team retrieves the exact dataset version and quality report that produced the deployed model.

Extended Scenario Narrative

Day 0 Data engineers ingest transactional logs via Kafka into a raw S3 bucket. Lambda functions partition data by event date.
Day 1 A scheduled Airflow DAG converts raw logs to Delta tables. Great Expectations validates balance, schema, and anomaly thresholds.
Day 2 Analysts branch the Delta table in LakeFS to experiment with new fraud features, logging changes in pull requests.
Day 3 After validation passes, the branch merges into production. A new Delta snapshot with version tag fraud_v15 is published.
Day 4 ML pipelines reference fraud_v15 for training. Metrics improve, and the model is promoted to canary deployment.
Day 7 Drift monitoring detects unusual behavior. Analysts quickly retrieve fraud_v14 and compare distributions to isolate suspect data.

Additional Use Cases

Personalized Marketing Segment customers using versioned behavioral datasets to ensure consistent campaign experiments.
Supply Chain Forecasting Maintain historical demand snapshots to test models against past seasons.
Healthcare Analytics Track provenance of clinical datasets to satisfy HIPAA and FDA regulations.
Autonomous Vehicles Manage sensor data versions to reproduce simulations and safety tests.
Financial Crime Investigation Preserve evidence-grade datasets for legal proceedings and model validation.

💡 Best Practices

✅ Layer environments Separate raw, staging, and production buckets.
Apply stricter access policies as data moves toward curated zones.
Document data lifecycle transitions with automated metadata updates.
✅ Automate validation Embed quality suites in CI on every data pull request.
Fail builds when rule violations occur, preventing regressions from reaching production.
Publish quality scorecards so stakeholders understand dataset health trends.
✅ Align tags Sync dataset tags with model registry versions.
For each registered model, record dataset hash, feature store snapshot, and labeling version.
✅ Centralize metadata Use catalogs for owners, SLAs, and sensitivity labels.
Provide APIs and UI to search by domain, responsible person, or regulatory classification.
✅ Version transformations Store dbt or Spark transformation code with semantic versioning.
Tie transformation commits to dataset releases for traceability.
✅ Monitor storage costs Review bucket usage reports and archive low-frequency data to cold storage tiers.
✅ Simulate disaster recovery Drill recovery scenarios to ensure dataset snapshots can be restored quickly.

⚠️ Common Pitfalls

🚫 Storing binaries in Git Repository bloat slows collaboration.
Delegate large files to object storage and track pointers with tools like DVC.
🚫 Manual copies Ad-hoc exports lose lineage and permissions.
Enforce data access through orchestrated pipelines and controlled branching.
🚫 Ignoring retention Accumulated stale versions spike storage costs.
Implement lifecycle policies and regularly purge or archive outdated snapshots.
🚫 Untracked schema changes Hidden column updates break downstream services.
Require schema change proposals with automated compatibility checks.
🚫 Missing access controls Broad permissions risk data leaks and compliance violations.
Integrate IAM, auditing, and approval workflows for sensitive datasets.
🚫 Lack of documentation Users cannot discover or trust datasets without context.
Make metadata and data dictionaries mandatory for every curated table.

🧩 Related Topics

Previous Topic 01_MLOps_Fundamentals.md
Builds foundational mindset for systemizing ML processes.
Next Topic 03_Experiment_Tracking.md
Shows how dataset versions connect with experiment runs and model registries.

🧭 Quick Recap

Step	Purpose
Version datasets	Reproduce experiments reliably.
Validate quality	Guard training from corrupt inputs.
Catalog lineage	Answer compliance and discovery questions quickly.
Govern access	Protect sensitive data and enforce least privilege.
Optimize retention	Balance legal requirements with storage efficiency.

How to Apply - Use the recap checklist during sprint reviews to confirm data governance coverage. - Embed steps into onboarding manuals for new data engineers and scientists. - Map each step to tooling (e.g., DVC, Great Expectations, DataHub) for quick reference.

🖼️ Assets

Diagram
Storyboard Idea Depict branching workflows where analysts test changes without impacting production.
Infographic Tip Highlight the flow of data lineage from raw ingestion to model consumption.

📘 References

Official docs https://dvc.org/doc
Official docs https://lakefs.io/docs/
Official docs https://docs.delta.io/latest/index.html
Blog posts https://lakefs.io/blog/data-versioning-patterns
Blog posts https://databricks.com/blog/category/delta-lake
Blog posts https://great-expectations.com/blog
GitHub examples https://github.com/delta-io/delta
GitHub examples https://github.com/iterative/dvc
GitHub examples https://github.com/openlineage/openlineage