Course 434:
Hadoop for MapReduce Applications

(2 days)

 

Course Description

Students who attend this course will leave armed with the skills they require to develop reliable, scalable, distributed applications using Apache Hadoop. The course is useful for programmers who need to create MapReduce applications and for data architects who can program in scripted languages like Python or Perl, or perform analysis using statistical packages such as R. It is also beneficial for system programmers and anyone familiar with writing scripts. Finally, it is useful for system and data administrators tasked with maintaining Hadoop applications since it provides them a deep understanding of how Hadoop applications communicate and share data.

Learning Objectives

After successfully completing this course, students will be able to:

  • Develop reliable, scalable, distributed applications using Hadoop
  • Set up and configure Hadoop
  • Implement a Map Reduce application
  • Manage and monitor memory, parameters, and jobs
  • Employ Hadoop streaming to run non-Java programs
  • Distribute application data and storage using HDFS
  • Data replication, organization, and reclamation

Who Should Attend

Anyone wishing to develop, maintain, or deploy scalable, distributed applications.

Prerequisites

Experience in Java and data administration is helpful.

Hands-On Labs

The hands-on labs are a key learning element of this course. Each lab reinforces the material presented in lecture.


Course Outline

Overview of Apache Hadoop MR2

  • Introducing Map, Reduce, Combine, and Shuffle
  • MapReduce from a Linux Console
  • Scaling MapReduce on a Cluster
  • Introducing Apache Hadoop
  • Configuring, Starting and Stopping Hadoop
  • MapReduce with Hadoop
  • Introducing HDFS and YARN
  • Key-Value Pairs and Directed Acyclic Graphs

The Hadoop Distributed File System (HDFS)

  • Overview of HDFS
  • Launching HDFS in Pseudo-Distributed Mode
  • Core HDFS Services
  • Installing and Configuring HDFS
  • HDFS Commands
  • HDFS Safe Mode
  • Check Pointing HDFS
  • Federated and High Availability HDFS
  • Running a Fully Distributed HDFS Cluster with Docker

Analyzing Data Using Hadoop MR2 Streaming

  • Overview of YARN
  • Launching YARN is Pseudo-Distributed Mode
  • Core YARN Services
  • Installing and Configuring YARN
  • Running a Fully Distributed Hadoop Cluster with Docker
  • Planning MapReduce2 Jobs
  • Processing Log Files with UNIX Tools
  • Running Tasks in Pseudo-Distributed Mode
  • Changing the Number of Map and Reduce Tasks
  • Supplying Scripts to Hadoop
  • The Aggregate Package
  • Distributive, Algebraic, and Holistic Statistics

Writing a MR2 Application in Java

  • Why Use Java?
  • Implementing a Mapper and Reducer
  • Using Context
  • Processing Parameters with Configuration
  • Including a Custom Combiner
  • Hadoop Serialization and Custom Writable
  • Managing Tabular Data
  • Improve MapReduce Performance with Partitioning
  • Counters

Please Contact Your ROI Representative to Discuss Course Tailoring!