Advanced Data Engineering with Databricks

Course ID: DTB-ADED
Duration: 2 Days
Private in-house training

Apart from public, instructor-led classes, we also offer private in-house trainings for organizations based on their needs. Call us at +852 2116 3328 or email us at [email protected] for more details.

What are the skills covered
  • Databricks Streaming and Lakeflow Spark Declarative Pipelines
  • Databricks Data Privacy
  • Databricks Performance Optimization
  • Automated Deployment with Databricks Asset Bundles
Who should attend this course
  • Everyone who is interested
Course Modules

Module 1: Databricks Streaming and Lakeflow Spark Declarative Pipelines

  •  Streaming Data Concepts
  • Introduction to Structured Streaming
  • Demo: Reading from a Streaming Query
  • Streaming from Delta Lake
  • Streaming Query Lab
  • Aggregation, Time Windows, Watermarks
  • Event Time + Aggregatios over Time Windows
  • Trigger Types and Output Modes
  • Stream Aggregation Lab
  • Demo: Windowed Aggregation with Watermark
  • Stream Joins(Optional)
  • Demo: Stream Joins(Optional)
  • Data Ingestion Pattern
  • Demo: Auto Load to Bronze
  • Demo: Stream from Multiplex Bronze
  • Data Quality Enforcement
  • Demo: Data Quality Enforcement
  • Streaming ETL Lab

 

Module 2: Databricks Data Privacy

  • Regulatory Compliance
  • Data Privacy
  • Key Concepts and Components
  • Audit Your Data
  • Data Isolation
  • Demo: Securing Data in Unity Catalog
  • Pseudonymization & Anonymization
  • Summary & Best Practices
  • Demo: PII Data Security
  • Capturing Changed Data
  • Deleting Data in Databricks
  • Demo: Processing Records from CDF and Propagating Changes
  • Lab: Propagating Changes with CDF Lab

 

Module 3: Databricks Performance Optimization

  • DevOps Spark UI Introduction
  • Introduction to Designing Foundation
  • Demo: File Explosion
  • Data Skipping and Liquid Clustering
  • Lab: Data Skipping and Liquid Clustering
  • Skew
  • Shuffles
  • Demo: Shuffle
  • Spill
  • Lab: Exploding Join
  • Serialization
  • Demo: User-Defined Functions
  • Fine-Tuning: Choosing the Right Cluster
  • Pick the Best Instance Types

 

Module 4: Automated Deployment with Databricks Asset Bundles

  • DevOps Review
  • Continuous Integration and Continuous Deployment/Delivery (CI/CD) Review
  • Demo: Course Setup and Authentication
  • Deploying Databricks Projects
  • Introduction to Databricks Asset Bundles (DABs)
  • Demo: Deploying a Simple DAB
  • Lab: Deploying a Simple DAB
  • Variable Substitutions in DABs
  • Demo: Deploying a DAB to Multiple Environments
  • Lab: Deploy a DAB to Multiple Environments
  • DAB Project Templates Overview
  • Lab: Use a Databricks Default DAB Template
  • CI/CD Project Overview with DABs
  • Demo: Continuous Integration and Continuous Deployment with DABs
  • Lab: Adding ML to Engineering Workflows with DABs
  • Developing Locally with Visual Studio Code (VSCode)
  • Demo: Using VSCode with Databricks
  • CI/CD Best Practices for Data Engineering
  • Next Steps: Automated Deployment with GitHub Actions
Prerequisites
  • Ability to perform basic code development tasks using the Databricks Data Engineering and Data Science workspace (create clusters, run code in notebooks, use basic notebook operations, import repos from git, etc.)
  • Intermediate programming experience with PySpark
  • Extract data from a variety of file formats and data sources
  • Apply a number of common transformations to clean data
  • Reshape and manipulate complex data using advanced built-in functions
  • Intermediate programming experience with Delta Lake (create tables, perform complete and incremental updates, compact files, restore previous versions, etc.)
  • Beginner experience configuring and scheduling data pipelines using the Lakeflow Spark Declarative Pipelines UI
  • Beginner experience defining Lakeflow Spark Declarative Pipelines using PySpark
  • Ingest and process data using Auto Loader and PySpark syntax
  • Process Change Data Capture feeds with APPLY CHANGES INTO syntax
  • Review pipeline event logs and results to troubleshoot Declarative Pipeline syntax

• Strong knowledge of the Databricks platform, including experience with Databricks Workspaces, Apache Spark, Delta Lake, the Medallion Architecture, Unity Catalog, Lakeflow Declarative Pipelines, and Workflows. In particular, knowledge of leveraging Expectations with Lakeflow Declarative Pipelines.

• Experience in data ingestion and transformation, with proficiency in PySpark for data processing and DataFrame manipulation. Candidates should also have experience writing intermediate-level SQL queries for data analysis and transformation.

• Proficiency in Python programming, including the ability to design and implement functions and classes, and experience with creating, importing, and utilizing Python packages.

• Familiarity with DevOps practices, particularly continuous integration and continuous delivery/deployment (CI/CD) principles.

• A basic understanding of Git version control.

• Prerequisite course DevOps Essentials for Data Engineering Course

Search for a course