This post covers the core concepts, tools, and services essential for data engineering and cloud-based workloads on AWS. It provides an end-to-end perspective—from fundamentals of data to advanced analytics, governance, and certification preparation.
1. Data Engineering Fundamentals
- Types of Data: Structured, semi-structured, and unstructured, with examples and characteristics.
- 3Vs of Big Data: Volume, Velocity, Variety.
- Data Warehouses vs. Data Lakes: Schema-on-write vs. schema-on-read, ETL vs. ELT, and the rise of the Data Lakehouse.
- Data Mesh: Decentralized ownership and domain-driven data products.
- ETL Pipelines: Extract, Transform, Load workflows with AWS Glue, EventBridge, MWAA, Step Functions, and Lambda.
- Data Sources & Formats: JDBC, ODBC, APIs, streams; CSV, JSON, Avro, Parquet.
- Data Modeling: Star schema, fact tables, dimensions, and ERDs.
- Data Lineage: Tracking data flow for compliance and debugging.
- Schema Evolution: Adapting schemas with AWS Glue Schema Registry.
- Database Optimization: Indexing, partitioning, and compression.
- Data Sampling: Random and stratified sampling.
- Data Skew Solutions: Adaptive partitioning, salting, repartitioning.
- Data Validation & Profiling: Ensuring completeness, consistency, accuracy, and integrity.
- SQL Review: Aggregations, grouping, joins, pivoting, regex.
- Git Review: Core commands for collaboration and recovery.
2. Storage
- Amazon S3: Buckets, objects, storage classes, versioning, replication, encryption, performance tuning, S3 Select, Object Lambda.
- Amazon EBS: Persistent block storage for EC2, elastic resizing, AZ-bound.
- Amazon EFS: Scalable NFS for EC2 with high availability.
- AWS Backup: Centralized backups with PITR and Vault Lock.
3. Databases
- DynamoDB: Serverless NoSQL with LSIs, GSIs, DAX, TTL, Streams, and global tables.
- Amazon RDS & Aurora: Managed relational databases with ACID, scaling, replicas, and optimization.
- Amazon DocumentDB: MongoDB-compatible.
- MemoryDB for Redis: Durable, in-memory key-value store.
- Amazon Keyspaces: Cassandra-compatible, serverless.
- Amazon Neptune: Managed graph database.
- Amazon Timestream: Time series database for IoT/metrics.
- Amazon Redshift: Petabyte-scale data warehouse with Spectrum, RA3, serverless option, ML integration, and advanced performance features.
4. Migration & Transfer
- Application Discovery Service: Assess on-prem workloads.
- Application Migration Service (MGN): Lift-and-shift.
- Database Migration Service (DMS): Homogeneous & heterogeneous migrations.
- Schema Conversion Tool (SCT): Schema transformations.
- AWS DataSync: Automated large-scale data transfer.
- AWS Snow Family: Physical devices for large dataset migrations.
- AWS Transfer Family: FTP/SFTP/FTPS into S3/EFS.
5. Compute
- Amazon EC2: On-demand, Spot, Reserved, scaling for EMR.
- AWS Graviton: Cost-efficient custom processors.
- AWS Lambda: Serverless compute for real-time ETL, triggers, and integrations.
- AWS Batch: Docker batch jobs vs. AWS Glue comparison.
6. Containers
- Docker: Containers vs. VMs, stored in ECR.
- Amazon ECS: EC2 vs. Fargate launch types, IAM roles, load balancing, EFS volumes.
- Amazon ECR: Docker image registry.
- Amazon EKS: Managed Kubernetes with multiple storage options.
7. Analytics
- AWS Glue: Data catalog, crawlers, ETL, Glue Studio, DataBrew, Data Quality, workflows.
- AWS Lake Formation: Secure data lakes with governed tables.
- Amazon Athena: Serverless SQL on S3 with Iceberg support.
- Apache Spark: Batch, streaming, ML, and graph analytics.
- Amazon EMR: Managed Hadoop ecosystem, serverless & EKS options.
- Kinesis (Streams, Firehose, Analytics): Real-time ingestion, delivery, and querying.
- Amazon MSK: Managed Apache Kafka.
- Amazon OpenSearch Service: Managed search & analytics with dashboards.
- Amazon QuickSight: Cloud-native BI and ML-powered insights.
8. Application Integration
- Amazon SQS: Queues (Standard & FIFO) with DLQs.
- Amazon SNS: Pub/Sub messaging, fan-out with SQS.
- AWS Step Functions: Workflow orchestration.
- Amazon AppFlow: SaaS ↔ AWS data movement.
- Amazon EventBridge: Event-driven applications and schema registry.
- MWAA: Managed Apache Airflow.
9. Security, Identity & Compliance
- Principle of Least Privilege: Minimize permissions.
- Data Masking & Anonymization: PII handling.
- Key Salting: For password security.
- Data Residency: Compliance with region policies.
- IAM: Users, groups, roles, and MFA.
- Encryption: In-flight, at-rest, client-side.
- AWS KMS: Key management and rotation.
- AWS Macie: PII detection in S3.
- Secrets Manager: Secure secrets rotation.
- AWS WAF: Web security against exploits.
- Service-Specific Security: S3, DynamoDB, RDS, Redshift, Lambda, Glue, etc.
10. Networking & Content Delivery
- VPC & Subnets: Public/private networking.
- Internet Gateway & NAT Gateways: Connectivity.
- VPC Flow Logs: Network traffic logging.
- VPC Endpoints: Private service access.
- VPN & Direct Connect: On-prem connectivity.
- AWS PrivateLink: Private service sharing.
- Amazon Route 53: Managed DNS.
- Amazon CloudFront: CDN for content delivery & DDoS protection.
11. Management & Governance
- Amazon CloudWatch: Metrics, logs, alarms, dashboards.
- AWS CloudTrail: API auditing with CloudTrail Lake.
- AWS Config: Compliance & remediation.
- AWS CloudFormation: Infrastructure as Code.
- SSM Parameter Store: Secure config management.
- Well-Architected Framework: Six pillars of good architecture.
- Amazon Managed Grafana: Visualization and alerting.
12. Machine Learning
- Amazon SageMaker: Model building, training, deployment.
- Feature Store
- ML Lineage Tracking
- Data Wrangler
13. Developer Tools
- AWS CLI & SDKs: Programmatic AWS access.
- AWS Cloud9: Cloud IDE.
- AWS CDK: IaC with code.
- CodeDeploy, CodeCommit, CodeBuild, CodePipeline: CI/CD workflows.
14. Cost Management
- AWS Budgets: Cost and usage alerts.
- AWS Cost Explorer: Cost analysis & forecasting.
15. APIs & Miscellaneous
- Amazon API Gateway: Serverless APIs with versioning, security, caching, and throttling.