Skip to content
Go back

02 - AWS Data Engineering Fundamentals

Published:  at  07:00 PM

AWS Data Engineering Fundamentals:

Cue / ConceptNotes
Types of DataStructured: Predefined schema, relational DBs (Postgres, MySQL, Oracle). Examples: banking transactions, CRM records. Easy to query with SQL, supports ACID.

Semi-structured: Self-describing, no strict schema. Examples: JSON logs from APIs, XML configs, Avro files in Kafka. Flexible for evolving requirements but less efficient than structured.

Unstructured: No schema, raw formats. Examples: audio (podcasts), video (YouTube), text (emails), images (X-rays). Requires ML/AI for extracting insights (NLP, CV, speech-to-text).
3Vs of Big DataVolume: Petabytes/Zettabytes of data. Requires distributed storage (AWS S3, HDFS, GCP Cloud Storage). Scaling vertically is expensive, so horizontal scaling is common.

Velocity: Streaming data at high ingestion rates (IoT sensors, stock market ticks, clickstreams). Tools: Apache Kafka, AWS Kinesis, Spark Streaming, Flink.

Variety: Mix of structured (tables), semi-structured (JSON logs), and unstructured (images, PDFs). Challenge: ensuring interoperability and governance. Sometimes extended to 5Vs adding Veracity (data quality) and Value (business impact).
Data Warehouses vs. Data LakesData Warehouse: Optimized for analytics. Schema-on-write. Strong governance. Examples: Amazon Redshift, Snowflake, Google BigQuery. Uses ETL (clean/transform before loading). Best for BI dashboards, historical reporting.

Data Lake: Centralized storage for raw data. Schema-on-read. Stores structured + unstructured data. Examples: AWS S3, Azure Data Lake. Uses ELT (store raw first, transform later). Good for ML, data science exploration.

Data Lakehouse: Combines both. Flexibility of lakes + reliability/performance of warehouses. Supports ACID transactions. Examples: Databricks Lakehouse, Snowflake Unistore, AWS Athena + Iceberg.
Data MeshA decentralized approach: domains (finance, sales, HR) own their data as products.
Key principles:
1) Domain ownership: Each team manages their own pipelines.
2) Data as a product: Treated like APIs (well-documented, discoverable).
3) Self-service platform: Common tools for ingestion, governance, monitoring.
4) Federated governance: Policies and standards across domains.
Benefits: scalability, autonomy. Challenges: standardization, governance.
ETL PipelinesExtract: Pull from DBs (via JDBC/ODBC), APIs, message queues, IoT streams.
Transform: Clean nulls, validate formats, enrich (add geo/IP info), standardize (camelCase vs snake_case).
Load: Store into Redshift, Snowflake, or S3.

AWS services:
- AWS Glue: serverless ETL service.
- EventBridge: event-driven triggers.
- MWAA: Managed Apache Airflow for orchestration.
- Step Functions: state machines for workflows.
- Lambda: lightweight transformation functions.
Data Sources & FormatsSources:
- JDBC/ODBC: relational databases (Postgres, MySQL, SQL Server).
- APIs: REST (JSON/XML), GraphQL.
- Streams: Kafka, Kinesis, Pub/Sub.

Formats:
- CSV: Simple, human-readable, inefficient at scale.
- JSON: Flexible, nested, common for APIs.
- Avro: Row-based, schema registry support, efficient for Kafka.
- Parquet: Columnar, highly compressed, best for analytics.
- ORC: Columnar, optimized for Hive/Spark queries.
Data ModelingStar Schema: Central fact table + multiple dimensions. Simpler queries, good for BI.
Snowflake Schema: Dimensions normalized, reduces redundancy but more complex joins.
Fact Tables: Store numeric metrics (sales amount, impressions).
Dimension Tables: Describe entities (date, product, region).
ERD (Entity Relationship Diagram): Visualizes tables, keys, relationships. Used in logical and physical database design.
Data LineageShows origin, transformations, and destination of data.
Uses:
- Debugging: Trace errors to source.
- Compliance: GDPR “right to know” where data came from.
- Impact analysis: If schema changes, what downstream pipelines break?
Tools: AWS Glue Data Catalog, Collibra, Apache Atlas, Amundsen.
Schema EvolutionAdapting schemas as data changes.
Examples: adding a column (age), removing fields, type changes (intstring).
Forward compatibility: Old readers can handle new schemas.
Backward compatibility: New readers can handle old data.
Tools: AWS Glue Schema Registry, Confluent Schema Registry (Avro/Protobuf/JSON).
Database Optimization- Indexing: Improves lookup speed but costs storage. Types: B-Tree, Bitmap, Hash.
- Partitioning: Horizontal split of data by key (date, region). Improves query pruning.
- Sharding: Distribute data across multiple nodes.
- Compression: Reduces storage, speeds up IO. Formats: Snappy, ZSTD, GZIP.
- Caching: Use Redis/Memcached for hot queries.
Data Sampling- Random Sampling: Selects records randomly, useful for unbiased testing.
- Stratified Sampling: Ensures all groups represented (e.g., gender, region). Better for skewed distributions.
- Systematic Sampling: Every nth record. Efficient but may introduce bias if data ordered.
Used in ML training, A/B testing, data validation.
Data Skew Solutions- Adaptive Partitioning: Auto-adjusts partitions based on data load.
- Salting: Add random numbers to skewed keys to distribute evenly.
- Repartitioning: Increase number of partitions (Spark repartition()).
- Broadcast join: Send small table to all workers instead of shuffling big one.
Data Validation & Profiling- Completeness: All expected data is present.
- Consistency: Values don’t conflict across systems.
- Accuracy: Data reflects real-world truth.
- Integrity: Referential integrity between tables.
Profiling checks distributions, outliers, min/max values, null counts. Tools: Great Expectations, Deequ, dbt tests.
SQL Review- Aggregations: SUM, AVG, MIN, MAX, COUNT.
- Grouping: GROUP BY + HAVING for filtering aggregates.
- Joins: INNER (common), LEFT/RIGHT (with nulls), FULL (union).
- Pivot/Unpivot: Convert rows to columns (and vice versa).
- Regex: Pattern extraction/validation (REGEXP_LIKE, REGEXP_EXTRACT).

Suggest Changes

Previous Post
03 - Unlocking AWS Storage - The Complete 2024 Guide
Next Post
01 - Overview AWS for Data Engineering