Course 449:Spark for Big Data Processing

Course 449:
Spark for Big Data Processing

(3 days)

Course Description

Apache Sparks builds on the success of Apache Hadoop in two ways. First, Spark executes MapReduce applications one to two orders of magnitude faster than Hadoop. Second, Spark not only supports batch-oriented MapReduce but also provides direct support for MapReduce with streaming data. In addition, Spark also directly supports graph processing and machine learning. Spark applications can be developed incrementally using several command-line shells including Python, R, SQL, Java, and Scala.

Learning Objectives

After successfully completing this course, students will be able to:

Describe the Spark architecture including cluster management and file system
Explain the components of a Spark application
Implement a Spark application based on Resilient Distributed Datasets (RDDs)
Interact with Spark using Jupyter notebooks
Motivate the use of SQL as an API for MapReduce applications
Create and manipulate a relational table using Spark SQL and DataFrames
Perform MapReduce with Spark streaming
Implement distributed machine learning applications with Spark ML

Prerequisites

Participants should have completed ROI’s “Big Data: Understanding Hadoop and Its Ecosystem” course or have equivalent experience. Spark functionality is illustrated with Python, Scala, Java, and SQL. Participants should be comfortable reading these languages. In addition, participants will be asked to adapt code shown in the slides to solve problems presented in the exercises.

Who Should Attend

This course is intended for anyone wanting to understand how to implement MapReduce applications with Spark. This is a hands-on course. The exercises are intended to give the participants first-hand experience with developing Spark applications.

Course Outline

Chapter 1: Spark Introduction

Spark Architecture and RDD Basics
Spark Shell
Exercise: Starting with Spark Shell and Working with Word Count Example
RDD Lineage and Partitions Basics
Exercise: Working with Scala IDE and Running it via Spark Submit in a Batch Mode

Chapter 2: RDDs: Resilient Distributed Datasets

Coding with RDDs
Transformations
Actions
Lazy Evaluation and Optimization
RDDs in MapReduce
Exercise: Working with RDDs
Exercise: Applying the Concepts with Airline POC Data (or some other case study)
What are Notebooks and Setting Up Jupyter Notebook for Python

Chapter 3: Spark SQL

Why Spark SQL
What and Why of DataFrames
Exercise: Creating a Table via sqlContext.sql and Checking the Same in HDFS
Creating DataFrames with Scala Examples
Creating DataFrames with Python Examples
Working with JSON Files with Scala
Working with Customized Databricks CSV Library

Chapter 4: Working with Spark

Exercise: Starting PySpark and Working with Line Count Example
Exercise: Working on Python Script File and Submission of the Same via Spark Submit
Difference Between Scala and Python
Exercise: Working with Eclipse and Writing the Word Count via Java Code

Chapter 5: DataFrames

RDD vs. DataFrames
Unified DataFrames (UDF) in Spark
Working with 30 Data Frame Operations Using Sample Data
Checkpointing and Persist Operations
Creating UDFs and Using Them in DataFrames and via sqlContext.sql

Chapter 6: Spark Streaming

Need for Streaming and Streaming Architecture
Lambda Architecture
Spark Streaming Using PySpark
Spark Streaming Using Scala IDE and Executing via spark-submit

Chapter 7: Spark MLlib (Machine Learning)

ML Lib
Exercise: Hello MLlib

Please Contact Your ROI Representative to Discuss Course Tailoring!