Course 449:
Spark for Big Data Processing

(3 days)


Course Description

Apache Sparks builds on the success of Apache Hadoop in two ways. First, Spark executes MapReduce applications one to two orders of magnitude faster than Hadoop. Second, Spark not only supports batch-oriented MapReduce but also provides direct support for MapReduce with streaming data. In addition, Spark also directly supports graph processing and machine learning. Spark applications can be developed incrementally using several command-line shells including Python, R, SQL, Java, and Scala.

Learning Objectives

After successfully completing this course, students will be able to:

  • Describe the Spark architecture including cluster management and file system
  • Explain the components of a Spark application
  • Implement a Spark application based on Resilient Distributed Datasets (RDDs)
  • Interact with Spark using Jupyter notebooks
  • Motivate the use of SQL as an API for MapReduce applications
  • Create and manipulate a relational table using Spark SQL and DataFrames
  • Perform MapReduce with Spark streaming
  • Implement distributed machine learning applications with Spark ML


Participants should have completed ROI’s “Big Data: Understanding Hadoop and Its Ecosystem” course or have equivalent experience. Spark functionality is illustrated with Python, Scala, Java, and SQL. Participants should be comfortable reading these languages. In addition, participants will be asked to adapt code shown in the slides to solve problems presented in the exercises.

Who Should Attend

This course is intended for anyone wanting to understand how to implement MapReduce applications with Spark. This is a hands-on course. The exercises are intended to give the participants first-hand experience with developing Spark applications.

Course Outline

Chapter 1: Spark Introduction

  • Spark Architecture and RDD Basics
  • Spark Shell
  • Exercise: Starting with Spark Shell and Working with Word Count Example
  • RDD Lineage and Partitions Basics
  • Exercise: Working with Scala IDE and Running it via Spark Submit in a Batch Mode

Chapter 2: RDDs: Resilient Distributed Datasets

  • Coding with RDDs
  • Transformations
  • Actions
  • Lazy Evaluation and Optimization
  • RDDs in MapReduce
  • Exercise: Working with RDDs
  • Exercise: Applying the Concepts with Airline POC Data (or some other case study)
  • What are Notebooks and Setting Up Jupyter Notebook for Python

Chapter 3: Spark SQL

  • Why Spark SQL
  • What and Why of DataFrames
  • Exercise: Creating a Table via sqlContext.sql and Checking the Same in HDFS
  • Creating DataFrames with Scala Examples
  • Creating DataFrames with Python Examples
  • Working with JSON Files with Scala
  • Working with Customized Databricks CSV Library

Chapter 4: Working with Spark

  • Exercise: Starting PySpark and Working with Line Count Example
  • Exercise: Working on Python Script File and Submission of the Same via Spark Submit
  • Difference Between Scala and Python
  • Exercise: Working with Eclipse and Writing the Word Count via Java Code

Chapter 5: DataFrames

  • RDD vs. DataFrames
  • Unified DataFrames (UDF) in Spark
  • Working with 30 Data Frame Operations Using Sample Data
  • Checkpointing and Persist Operations
  • Creating UDFs and Using Them in DataFrames and via sqlContext.sql

Chapter 6: Spark Streaming

  • Need for Streaming and Streaming Architecture
  • Lambda Architecture
  • Spark Streaming Using PySpark
  • Spark Streaming Using Scala IDE and Executing via spark-submit

Chapter 7: Spark MLlib (Machine Learning)

  • ML Lib
  • Exercise: Hello MLlib

Please Contact Your ROI Representative to Discuss Course Tailoring!