Databricks Performance Optimization
Contact us to book this courseData Engineering
On-Site, Virtual
1 day
In this course, you’ll learn how to optimize workloads and physical layout with Spark and Delta Lake and and analyze the Spark UI to assess performance and debug applications. We’ll cover topics like streaming, liquid clustering, data skipping, caching, photons, and more.
Objectives
-
Describe strategies and best practices for optimizing workloads on Databricks
-
Analyze information presented in the Spark UI, Ganglia UI, and Cluster UI to assess performance and debug applications
Prerequisites
The content was developed for participants with these skills/knowledge/abilities:
-
Ability to perform basic code development tasks using Databricks (create clusters, run code in notebooks, use basic notebook operations, import repos from git, etc.)
-
Intermediate programming experience with PySpark
-
Extract data from a variety of file formats and data sources
-
Apply a number of common transformations to clean data
-
Reshape and manipulate complex data using advanced built-in functions
-
-
Intermediate programming experience with Delta Lake (create tables, perform complete and incremental updates, compact files, restore previous versions, etc.)
Course outline
- Data Skipping
- Skew
- Shuffle
- Spill
- Join Optimization Lab
- Serialization
- User-Defined Functions
- Fine-Tuning: Choosing the Right Cluster
- Pick the Best Instance Types