02 - AWS Data Engineering Fundamentals

AWS Data Engineering Fundamentals:

Cue / Concept	Notes
Types of Data	Structured: Predefined schema, relational DBs (Postgres, MySQL, Oracle). Examples: banking transactions, CRM records. Easy to query with SQL, supports ACID. Semi-structured: Self-describing, no strict schema. Examples: JSON logs from APIs, XML configs, Avro files in Kafka. Flexible for evolving requirements but less efficient than structured. Unstructured: No schema, raw formats. Examples: audio (podcasts), video (YouTube), text (emails), images (X-rays). Requires ML/AI for extracting insights (NLP, CV, speech-to-text).
3Vs of Big Data	Volume: Petabytes/Zettabytes of data. Requires distributed storage (AWS S3, HDFS, GCP Cloud Storage). Scaling vertically is expensive, so horizontal scaling is common. Velocity: Streaming data at high ingestion rates (IoT sensors, stock market ticks, clickstreams). Tools: Apache Kafka, AWS Kinesis, Spark Streaming, Flink. Variety: Mix of structured (tables), semi-structured (JSON logs), and unstructured (images, PDFs). Challenge: ensuring interoperability and governance. Sometimes extended to 5Vs adding Veracity (data quality) and Value (business impact).
Data Warehouses vs. Data Lakes	Data Warehouse: Optimized for analytics. Schema-on-write. Strong governance. Examples: Amazon Redshift, Snowflake, Google BigQuery. Uses ETL (clean/transform before loading). Best for BI dashboards, historical reporting. Data Lake: Centralized storage for raw data. Schema-on-read. Stores structured + unstructured data. Examples: AWS S3, Azure Data Lake. Uses ELT (store raw first, transform later). Good for ML, data science exploration. Data Lakehouse: Combines both. Flexibility of lakes + reliability/performance of warehouses. Supports ACID transactions. Examples: Databricks Lakehouse, Snowflake Unistore, AWS Athena + Iceberg.
Data Mesh	A decentralized approach: domains (finance, sales, HR) own their data as products. Key principles: 1) Domain ownership: Each team manages their own pipelines. 2) Data as a product: Treated like APIs (well-documented, discoverable). 3) Self-service platform: Common tools for ingestion, governance, monitoring. 4) Federated governance: Policies and standards across domains. Benefits: scalability, autonomy. Challenges: standardization, governance.
ETL Pipelines	Extract: Pull from DBs (via JDBC/ODBC), APIs, message queues, IoT streams. Transform: Clean nulls, validate formats, enrich (add geo/IP info), standardize (camelCase vs snake_case). Load: Store into Redshift, Snowflake, or S3. AWS services: - AWS Glue: serverless ETL service. - EventBridge: event-driven triggers. - MWAA: Managed Apache Airflow for orchestration. - Step Functions: state machines for workflows. - Lambda: lightweight transformation functions.
Data Sources & Formats	Sources: - JDBC/ODBC: relational databases (Postgres, MySQL, SQL Server). - APIs: REST (JSON/XML), GraphQL. - Streams: Kafka, Kinesis, Pub/Sub. Formats: - CSV: Simple, human-readable, inefficient at scale. - JSON: Flexible, nested, common for APIs. - Avro: Row-based, schema registry support, efficient for Kafka. - Parquet: Columnar, highly compressed, best for analytics. - ORC: Columnar, optimized for Hive/Spark queries.
Data Modeling	Star Schema: Central fact table + multiple dimensions. Simpler queries, good for BI. Snowflake Schema: Dimensions normalized, reduces redundancy but more complex joins. Fact Tables: Store numeric metrics (sales amount, impressions). Dimension Tables: Describe entities (date, product, region). ERD (Entity Relationship Diagram): Visualizes tables, keys, relationships. Used in logical and physical database design.
Data Lineage	Shows origin, transformations, and destination of data. Uses: - Debugging: Trace errors to source. - Compliance: GDPR “right to know” where data came from. - Impact analysis: If schema changes, what downstream pipelines break? Tools: AWS Glue Data Catalog, Collibra, Apache Atlas, Amundsen.
Schema Evolution	Adapting schemas as data changes. Examples: adding a column (`age`), removing fields, type changes (`int` → `string`). Forward compatibility: Old readers can handle new schemas. Backward compatibility: New readers can handle old data. Tools: AWS Glue Schema Registry, Confluent Schema Registry (Avro/Protobuf/JSON).
Database Optimization	- Indexing: Improves lookup speed but costs storage. Types: B-Tree, Bitmap, Hash. - Partitioning: Horizontal split of data by key (date, region). Improves query pruning. - Sharding: Distribute data across multiple nodes. - Compression: Reduces storage, speeds up IO. Formats: Snappy, ZSTD, GZIP. - Caching: Use Redis/Memcached for hot queries.
Data Sampling	- Random Sampling: Selects records randomly, useful for unbiased testing. - Stratified Sampling: Ensures all groups represented (e.g., gender, region). Better for skewed distributions. - Systematic Sampling: Every nth record. Efficient but may introduce bias if data ordered. Used in ML training, A/B testing, data validation.
Data Skew Solutions	- Adaptive Partitioning: Auto-adjusts partitions based on data load. - Salting: Add random numbers to skewed keys to distribute evenly. - Repartitioning: Increase number of partitions (Spark `repartition()`). - Broadcast join: Send small table to all workers instead of shuffling big one.
Data Validation & Profiling	- Completeness: All expected data is present. - Consistency: Values don’t conflict across systems. - Accuracy: Data reflects real-world truth. - Integrity: Referential integrity between tables. Profiling checks distributions, outliers, min/max values, null counts. Tools: Great Expectations, Deequ, dbt tests.
SQL Review	- Aggregations: SUM, AVG, MIN, MAX, COUNT. - Grouping: `GROUP BY` + HAVING for filtering aggregates. - Joins: INNER (common), LEFT/RIGHT (with nulls), FULL (union). - Pivot/Unpivot: Convert rows to columns (and vice versa). - Regex: Pattern extraction/validation (`REGEXP_LIKE`, `REGEXP_EXTRACT`).