12 Data Engineering And Big Data
๐ง Data Engineering and Big Data
๐ฏ What Youโll Learn
- Data journey Follow raw data from source to clean dataset.
- Batch and streaming Know when to choose each style.
- Processing engines Work with Spark, Hadoop, Dask, and Beam.
- Storage choices Pick the right format and layer for every workload.
- Quality and governance Keep data trusted, secure, and compliant.
- Team habits Share clear runbooks, alerts, and documentation.
๐ Overview
Good data engineering keeps machine learning teams productive. This guide explains how to collect, clean, store, and deliver data in a reliable way. The language is plain. The examples stay close to daily work. You can skim any section and find quick reminders or checklists to apply in your projects.
๐งญ Data Flow Map
- Source systems Applications, sensors, CRM tools, third-party APIs.
- Ingestion Batch pulls, CDC streams, message queues.
- Landing zone Raw files kept in low-cost storage for replay.
- Processing Transformations, joins, quality checks.
- Serving layer Datasets ready for analysts, ML models, or APIs.
- Monitoring Metrics, logs, and alerts to catch breaks fast.
๐ฐ Ingestion Patterns
- Batch pulls Run on a schedule, often daily or hourly.
- Change data capture (CDC) Capture inserts, updates, and deletes in near real time.
- Streaming Continuous flow through Kafka, Kinesis, or Pub/Sub.
- API polling Call partner APIs with rate limits and retries.
- File drops Watch SFTP folders or shared buckets for new files.
- Event trackers Use SDKs to collect product events at scale.
๐ ๏ธ Processing Engines
| Engine | Best For |
|---|---|
| Apache Spark | Large-scale batch and streaming with rich APIs. |
| Hadoop MapReduce | Classic batch jobs on HDFS. |
| Dask | Pythonic parallel computing on clusters. |
| Apache Beam | Unified batch + streaming pipelines. |
| Flink | Low-latency streaming with complex state needs. |
| Snowflake Tasks | Managed SQL pipelines with auto scaling. |
Engine Selection Tips
- Start simple with SQL-based tools when datasets are small.
- Use Spark or Beam when data volume or velocity grows.
- Combine with orchestration tools like Airflow, Prefect, or Dagster.
- Test locally with sample data before pushing to clusters.
๐ฆ Storage Layers
- Raw zone Immutable copies in formats like JSON or CSV.
- Clean zone Standardized tables with soft schema enforcement.
- Curated zone Data marts or feature tables ready for ML or BI tools.
- Lakehouse Combine structured tables and unstructured objects.
- Data warehouse Managed systems like BigQuery, Snowflake, Redshift.
- Feature store Serve online and offline features with consistency.
File Formats
- CSV Human readable, but no schema or compression.
- Parquet Columnar, compressed, and fast for analytics.
- Avro Row-based with schema support and backward compatibility.
- ORC Columnar format tuned for Hadoop ecosystems.
- Delta / Iceberg / Hudi Table formats with ACID transactions on lakes.
๐ Data Quality Checks
- Schema validation Confirm column names, types, and null rules.
- Range checks Ensure values stay within accepted limits.
- Uniqueness Watch for duplicate IDs or events.
- Freshness Monitor arrival times to catch stale feeds.
- Anomaly detection Compare to historical averages to flag spikes.
- Business rules Custom checks like "total orders equals sum of sub orders".
Tools for Quality
- Great Expectations Declarative tests with data docs.
- Soda Low-code quality tests and monitoring.
- Deequ Library from Amazon for Spark-based checks.
- dbt tests Simple checks inside analytics pipelines.
๐ Governance and Security
- Access control Grant least privilege with IAM roles and groups.
- Data catalogs Use DataHub, Amundsen, or AWS Glue Data Catalog for discovery.
- Lineage tracking Record upstream and downstream dependencies.
- PII handling Mask or tokenize sensitive fields.
- Compliance Document GDPR, HIPAA, or SOC 2 requirements.
- Audit logs Keep records of who touched which dataset.
โ๏ธ ETL vs ELT
- ETL (Extract, Transform, Load) Transform before loading into the warehouse.
- ELT (Extract, Load, Transform) Load raw data first, transform inside the warehouse.
- Decision tips
- Choose ETL when compliance requires data cleaning before storage.
- Choose ELT when using cloud warehouses with strong compute features.
- Hybrid patterns also exist; pick what fits your toolchain.
๐ Streaming Concepts
- Event time vs processing time Align windows using watermarks.
- Windows Tumbling, sliding, session windows for aggregations.
- Stateful processing Manage rolling counts or averages over streams.
- Exactly-once delivery Use idempotent writes or transactional sinks.
- Backpressure Handle slow consumers with scaling or buffering.
- Dead letter queues Capture bad events for later review.
โ๏ธ Cloud Data Services
- AWS Glue Serverless ETL with Spark under the hood.
- AWS Athena Run SQL on S3 data with pay-per-query pricing.
- AWS Iceberg tables Manage ACID tables on data lakes with Glue catalogs.
- Amazon Kinesis Ingestion streams for real-time analytics.
- Azure Data Factory Orchestrate data flows and manage pipelines.
- Azure Synapse Combined SQL, Spark, and pipeline workspace.
- Google BigQuery Fully managed warehouse with auto scaling.
- Google Dataflow Managed Apache Beam runner for batch and streaming.
๐งฐ Example Airflow DAG
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
def load_raw():
pass
def build_features():
pass
with DAG(
dag_id="churn_pipeline",
start_date=datetime(2025, 1, 1),
schedule_interval="0 6 * * *",
catchup=False,
) as dag:
load = PythonOperator(task_id="load_raw", python_callable=load_raw)
features = PythonOperator(task_id="build_features", python_callable=build_features)
load >> features
๐ ๏ธ Example Spark Job
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.appName("sales_clean").getOrCreate()
data = spark.read.parquet("s3://bucket/raw/sales/")
clean = data.filter(col("amount") > 0)
clean.write.mode("overwrite").parquet("s3://bucket/clean/sales/")
๐ Orchestration Runbook
- Define SLAs Note expected runtime and data availability time.
- Alert rules Fail fast on missing inputs or long runtimes.
- Retry strategy Use exponential backoff for transient issues.
- Rollback Keep last good dataset and quick restore steps.
- Handoffs Document who gets paged and how to escalate.
๐งช Testing Pipelines
- Unit tests Mock file reads and function outputs.
- Integration tests Run pipelines on sample data in CI.
- Data tests Validate counts and key metrics after jobs finish.
- Performance tests Check runtime with large sample data.
- Canary runs Execute new jobs on a subset before full rollout.
๐ Observability
- Metrics Record row counts, latency, error rates.
- Logs Include dataset names, partition info, notebook references.
- Traces Follow data through multiple services with OpenTelemetry.
- Dashboards Display pipeline health in Grafana, Looker, or Data Studio.
- Alerts Page on-call when freshness or quality fails.
๐งฎ Cost Management
- Storage lifecycle Move cold data to cheaper tiers (Glacier, Archive, Nearline).
- Partition strategy Partition by date or region to reduce query scans.
- Compression Store data compressed and use column pruning.
- Job scheduling Run heavy jobs off-peak to get lower rates.
- Cluster sizing Match executors to workload; auto-stop idle clusters.
- Cost tagging Label jobs and resources with team or project names.
๐ Documentation Habits
- Data contracts Record expectations for schema, delivery frequency, and owners.
- Playbooks Share "how to rerun" steps for each pipeline.
- Glossary Define business terms to avoid confusion.
- Release notes Summarize changes to data models or transformations.
- FAQ pages Answer common team questions about data access.
๐งโ๐คโ๐ง Collaboration Tips
- Slack alerts Inform consumers about delays or schema changes early.
- Office hours Hold weekly drop-ins for stakeholders.
- Pair sessions Review tricky SQL or Spark code together.
- Training Offer quick sessions on new tools or pipelines.
- Feedback loops Collect requests and pain points in a shared tracker.
๐ Data Lifecycle Checklist
- Plan Confirm business questions, KPIs, and compliance needs.
- Build Create ingestion scripts, transformations, and tests.
- Deploy Ship pipelines with CI/CD and version tags.
- Operate Monitor metrics, fix alerts, optimize jobs.
- Improve Gather feedback, add new fields, retire unused data.
๐ Example Scenario
A retail company builds a daily demand forecast. Kafka streams point-of-sale events to a raw S3 bucket. An hourly Spark job cleans and aggregates sales by store and category. Great Expectations tests guard row counts and numeric ranges. Curated results land in a Delta Lake table, which dbt models expose to the forecasting team. Prefect orchestrates the flow and sends Slack alerts. When the team finds a spike in faulty data, they replay from the raw zone and document the cause in the data catalog.
๐ก Best Practices
- โ Start small Build thin slices of the pipeline and expand carefully.
- โ Automate quality Run tests on every batch before publishing data.
- โ Track lineage Know who depends on each dataset.
- โ Keep raw data Retain originals for audits and reprocessing.
- โ Align with ML Share feature availability and freshness schedules.
- โ Review costs Evaluate storage and compute bills monthly.
โ ๏ธ Common Pitfalls
- ๐ซ Hidden manual steps Automate ad-hoc scripts before they become critical.
- ๐ซ Unannounced schema changes Communicate with downstream users.
- ๐ซ Overloaded clusters Right-size infrastructure often.
- ๐ซ Ignored streaming lag Monitor end-to-end latency.
- ๐ซ Weak security Rotate credentials and limit access scopes.
๐งฉ Related Topics
- Previous Topic
11_Core_Programming_and_Foundations.md - Next Topic
13_Distributed_Training_and_Performance.md
๐งญ Quick Recap
| Step | Purpose |
|---|---|
| Ingest reliably | Bring data in without loss. |
| Process with care | Clean, join, and enrich data step by step. |
| Store smartly | Pick formats and layers that balance cost and speed. |
| Test and monitor | Catch issues before users do. |
| Communicate | Keep teams aligned on changes and health. |
๐ผ๏ธ Assets
- Diagram Pipeline from source systems through raw, clean, and curated layers.
- Checklist Daily stand-up template for pipeline status.
๐ References
- Official docs https://spark.apache.org/docs/latest/
- Official docs https://beam.apache.org/documentation/
- Official docs https://delta.io/
- Blog posts https://databricks.com/blog
- Blog posts https://cloud.google.com/blog/topics/data-analytics
- GitHub examples https://github.com/apache/airflow/tree/main/airflow/example_dags
- GitHub examples https://github.com/great-expectations/great_expectations