• Databricks
  • Data Engineering

Databricks Performance Optimization

Contact us to book this course
Learning Track icon
Learning Track

Data Engineering

Delivery methods icon
Delivery methods

On-Site, Virtual

Duration icon
Duration

1 day

In this course, you’ll learn how to optimize workloads and physical layout with Spark and Delta Lake and and analyze the Spark UI to assess performance and debug applications. We’ll cover topics like streaming, liquid clustering, data skipping, caching, photons, and more.

Objectives

By the end of this course, you should be able to:
  • Describe strategies and best practices for optimizing workloads on Databricks

  • Analyze information presented in the Spark UI, Ganglia UI, and Cluster UI to assess performance and debug applications

 

Prerequisites

The content was developed for participants with these skills/knowledge/abilities:  

  • Ability to perform basic code development tasks using Databricks (create clusters, run code in notebooks, use basic notebook operations, import repos from git, etc.)

  • Intermediate programming experience with PySpark

    • Extract data from a variety of file formats and data sources

    • Apply a number of common transformations to clean data

    • Reshape and manipulate complex data using advanced built-in functions

  • Intermediate programming experience with Delta Lake (create tables, perform complete and incremental updates, compact files, restore previous versions, etc.)

Course outline

  •  Data Skipping
  • Skew
  •  Shuffle 
  •  Spill
  •  Join Optimization Lab
  •  Serialization 
  •  User-Defined Functions
  • Fine-Tuning: Choosing the Right Cluster
  •  Pick the Best Instance Types

Ready to accelerate your team's innovation?