Module 3: Data Engineering & Distributed Systems¶
Data Engineering Fundamentals 📊¶
Session 1: This session introduces core data engineering concepts. We'll discuss the differences between ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) and explore their modern applications in data pipelines.
Session 2: We'll introduce the Data Medallion Architecture , a foundational pattern for building reliable data lakes. We'll explain the purpose of each layer (Bronze, Silver, Gold) and discuss how data flows through them.
Databricks & PySpark 📈¶
Session 3: We'll introduce Databricks as a platform for large-scale data processing. We'll explore the basics of Apache Spark and how it distributes data processing across a cluster.
Session 4: This is a hands-on session with PySpark. Students will learn to work with Spark DataFrames and perform common data manipulation operations like filter, select, and groupby in a distributed environment.
Spark SQL and Data Pipelines¶
Session 5: SQL basics. Students will write queries in Databricks to join tables, perform aggregations, and transform data.
Session 6: Building a simple data pipeline. Students will apply their knowledge to move and transform data between the medallion layers within Databricks using either PySpark or Spark SQL.