Course 433:
Analyzing Big Data Using Apache Hadoop MapReduce2

(4 days)


Course Description

The original Apache Hadoop MapReduce1 (MR1) framework effectively communicated to the industry an approach to mining Big Data and its consequent benefits. As a distributed processing framework, Hadoop MR1 was limited by the fact that it only supported map-reduce as a paradigm for analyzing data. With hammer in hand, everything can look like a nail, so too with Hadoop MR1. Even though it is possible to formulate many data mining problems as map-reduce problems, it soon becomes clear that mining Big Data should not be synonymous with map-reduce. The fact that they are separate concerns is best illustrated by the design of the Hadoop MapReduce2 (MR2) framework.

Assuming no prior knowledge of Hadoop, this course starts with presenting traditional batch-oriented map- reduce applications illustrating the role of the Hadoop Distributed File System (HDFS) and the Hadoop MapReduce2 framework.

The course goes on to present the new Hadoop MR2 architecture based on YARN (yet another resource negotiator) and illustrates how YARN is key to Hadoop MR2’s support of other frameworks such as Apache Spark (Spark) and Apache Storm (Storm).

Both Spark and Storm are examined and compared with MapReduce. Finally, this course introduces Apache Pig (Pig) and Apache Hive (Hive) and how they work in the YARN infrastructure to provide a relational perspective on HDFS data.

Learning Objectives

After successfully completing this course, students will be able to:

  • Describe the architecture of Hadoop MR2 and YARN
  • Explain the basic operation of HDFS
  • Develop MapReduce applications
  • Implement and execute Apache Spark applications running in YARN
  • Run Apache Storm applications within YARN
  • View HDFS data from a relational perspective using Pig and Hive


Examples are presented in Java, Python, and Scala. The exercises require the ability to make slight adaptations of code presented in the notes. The examples involving Pig and Hive will make use of SQL.

Who Should Attend

This course is intended for anyone wanting to understand how some of the major components of the Apache Hadoop MR2 ecosystem work including HDFS, YARN, MapReduce, Spark, Storm, Pig, and Hive. This is a hands-on course. The exercises are intended to give the participants first-hand experience with developing applications that run both in pseudo-distributed environments on their classroom machines as well as in fully-distributed environments in the cloud.

Hands-On Labs

The hands-on labs are a key learning element of this course. Each lab reinforces the material presented in lecture.

Course Outline

1.  Overview of Apache Hadoop MR2

  • Introducing Map, Reduce, Combine, and Shuffle
  • MapReduce from a Linux Console
  • Scaling MapReduce on a Cluster
  • Introducing Apache Hadoop
  • Configuring, Starting and Stopping Hadoop
  • MapReduce with Hadoop
  • Introducing HDFS and YARN
  • Key-Value Pairs and Directed Acyclic Graphs

2.   The Hadoop Distributed File System (HDFS)

  • Overview of HDFS
  • Launching HDFS in Pseudo-Distributed Mode
  • Core HDFS Services
  • Installing and Configuring HDFS
  • HDFS Commands
  • HDFS Safe Mode
  • Check Pointing HDFS
  • Federated and High Availability HDFS
  • Running a Fully Distributed HDFS Cluster with Docker

3.   Analyzing Data Using Hadoop MR2 Streaming

  • Overview of YARN
  • Launching YARN is Pseudo-Distributed Mode
  • Core YARN Services
  • Installing and Configuring YARN
  • Running a Fully Distributed Hadoop Cluster with Docker
  • Planning MapReduce2 Jobs
  • Processing Log Files with UNIX Tools
  • Running Tasks in Pseudo-Distributed Mode
  • Changing the Number of Map and Reduce Tasks
  • Supplying Scripts to Hadoop
  • The Aggregate Package
  • Distributive, Algebraic, and Holistic Statistics

4.   Writing a MR2 Application in Java

  • Why Use Java?
  • Implementing a Mapper and Reducer
  • Using Context
  • Processing Parameters with Configuration
  • Including a Custom Combiner
  • Hadoop Serialization and Custom Writable
  • Managing Tabular Data
  • Improve MapReduce Performance with Partitioning
  • Counters

5.   Apache Spark

  • Why Spark?
  • Spark Standalone vs. Spark on YARN
  • Spark’s Resilient Distributed Datasets (RDDs)
  • Spark Drivers and Executors with YARN
  • Analyzing Data with Spark Streaming
  • Using Spark’s Query Language Spark SQL
  • Graph Processing with Spark GraphX
  • Machine Learning with MLib

6.   Apache Storm

  • Processing Real-Time Streaming Data
  • Storm Architecture: Nimbus, Supervisors, and ZooKeeper
  • Application Design: Topologies, Spouts, and Bolts
  • Real-Time Processing of Word Count
  • Transactional Topologies
  • Storm on YARN
  • Graph Processing with Storm

7.   Apache Pig

  • Declarative vs. Procedural
  • Role of Pig
  • Setting Up Pig
  • Loading and Working with Data
  • Writing a Pig Script
  • Executing Pig in Local and Hadoop Mode
  • Filtering Results
  • Storing, Loading, Dumping

8. Getting the Most Out of Pig

  • Relations, Tuples, Fields
  • Pig Data Types
  • Tuples, Bags, and Maps
  • Flatten on Bags and Tuples
  • Join and Union
  • Regular Expressions

9.   Apache Hive

  • What Is Hive?
  • Pig vs. Hive
  • Ad-hoc Queries
  • Creating Hive Tables
  • Loading Data from Flat Files into Hive
  • SQL Queries
  • Partitions and Buckets
  • Built-in and Aggregation Functions
  • MapReduce Scripts

Please Contact Your ROI Representative to Discuss Course Tailoring!