Course 434:
Hadoop for MapReduce Applications

(2 days)


Course Description

Students who attend this course will leave armed with the skills they require to develop reliable, scalable, distributed applications using Apache Hadoop. The course is useful for those who want to understand what MapReduce is, how Apache Hadoop implements MapReduce, and how to design and implement MapReduce applications using the scripting and Java APIs.

Learning Objectives

After successfully completing this course, students will be able to:

  • Understand the concepts of MapReduce and Big Data
  • Leverage Hadoop as a reliable, scalable MapReduce framework
  • Utilize the Hadoop distributed file system (HDFS) for storing big data files
  • Employ Hadoop streaming to run non-Java programs
  • Implement MapReduce applications in Java

Who Should Attend

Anyone wishing to understand, develop, or maintain Hadoop MapReduce applications.


Experience in Java and data administration is helpful.

Hands-On Labs

The hands-on labs are a key learning element of this course. Each lab reinforces the material presented in lecture.

Course Outline

1.   Overview of Big Data

  • What Is Big Data?
  • Big Data Use Cases
  • The Rise of the Data Center and Cloud Computing
  • MapReduce and Batch Data Processing
  • MapReduce and Near Real-Time (Stream) Processing
  • NoSQL Solutions for Persisting Big Data
  • The Big Data Ecosystem

2.  The Hadoop Distributed File System (HDFS)

  • Overview of HDFS
  • Launching HDFS in Pseudo-Distributed Mode
  • Core HDFS Services
  • Installing and Configuring HDFS
  • HDFS Commands
  • HDFS Safe Mode
  • Check Pointing HDFS
  • Federated and High Availability HDFS
  • Running a Fully-Distributed HDFS Cluster with Docker

3.  MapReduce with Hadoop

  • MapReduce from the Linux Command Line
  • Scaling MapReduce on a Cluster
  • Introducing Apache Hadoop
  • Overview of YARN
  • Launching YARN in Pseudo-Distributed Mode
  • Demonstration of the Hadoop Streaming API
  • Demonstration of MapReduce with Java

4.  Introduction to Apache Spark

  • Why Spark?
  • Spark Architecture
  • Spark Drivers and Executors
  • Spark on YARN
  • Spark and the Hive Metastore
  • Structured APIs, DataFrames, and Datasets
  • The Core API and Resilient Distributed Datasets (RDDs)
  • Overview of Functional Programming
  • MapReduce with Python

Please Contact Your ROI Representative to Discuss Course Tailoring!