Module 3: Data Engineering & Distributed Systems¶

Data Engineering Fundamentals 📊¶

Session 1: This session introduces core data engineering concepts. We'll discuss the differences between ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) and explore their modern applications in data pipelines.

Session 2: We'll introduce the Data Medallion Architecture , a foundational pattern for building reliable data lakes. We'll explain the purpose of each layer (Bronze, Silver, Gold) and discuss how data flows through them.

Databricks & PySpark 📈¶

Session 3: We'll introduce Databricks as a platform for large-scale data processing. We'll explore the basics of Apache Spark and how it distributes data processing across a cluster.

Session 4: This is a hands-on session with PySpark. Students will learn to work with Spark DataFrames and perform common data manipulation operations like filter, select, and groupby in a distributed environment.

Spark SQL and Data Pipelines¶

Session 5: SQL basics. Students will write queries in Databricks to join tables, perform aggregations, and transform data.

Session 6: Building a simple data pipeline. Students will apply their knowledge to move and transform data between the medallion layers within Databricks using either PySpark or Spark SQL.