Databricks Performance Optimization

Learning Track

Data Engineering

Delivery methods

On-Site, Virtual

Duration

1 day

In this course, you’ll learn how to optimize workloads and physical layout with Spark and Delta Lake and and analyze the Spark UI to assess performance and debug applications. We’ll cover topics like streaming, liquid clustering, data skipping, caching, photons, and more.

Objectives

By the end of this course, you should be able to:

Describe strategies and best practices for optimizing workloads on Databricks
Analyze information presented in the Spark UI, Ganglia UI, and Cluster UI to assess performance and debug applications

Prerequisites

The content was developed for participants with these skills/knowledge/abilities:

Ability to perform basic code development tasks using Databricks (create clusters, run code in notebooks, use basic notebook operations, import repos from git, etc.)
Intermediate programming experience with PySpark
- Extract data from a variety of file formats and data sources
- Apply a number of common transformations to clean data
- Reshape and manipulate complex data using advanced built-in functions
Intermediate programming experience with Delta Lake (create tables, perform complete and incremental updates, compact files, restore previous versions, etc.)

Course outline

1Designing the Foundation File Explosion

Data Skipping

2Code Optimization

Skew
Shuffle
Spill
Join Optimization Lab
Serialization
User-Defined Functions

3Fine-Tuning - Choosing the Right Cluster

Fine-Tuning: Choosing the Right Cluster
Pick the Best Instance Types

Ready to accelerate your team's innovation?

Schedule a meeting

Unlock your team’s potential and get the most from your tech stack

Schedule a meeting