Course 433:
Analyzing Big Data Using Hadoop, Hive, Spark, and HBase

(4 days)


Course Description

This course starts with an overview of Big Data and its role in the enterprise. It then presents the Hadoop Distributed File System (HDFS) which is a foundation for much of the other Big Data technology shown in the course. Hadoop MapReduce is then introduced and simple MapReduce applications are demonstrated using both the streaming and Java APIs.

At this point, the stage has been set to introduce Apache Spark on YARN as a highly performant and flexible platform for cluster computing. Spark’s architecture and APIs are presented with an emphasis on mining HDFS data with MapReduce.

The focus of the course then shifts to using Hadoop as a data warehouse platform. The first technology looked at from this perspective is Apache Hive. Hive allows clients to access HDFS files as if they were relational tables. This is accomplished using an SQL-like query language called the Hive Query Language (HQL). The course gives an overview of HQL and shows how table metadata can be accessed by other applications such as Spark.

This is followed by a discussion of the HBase column-family database. The HBase architecture and data model and their relationship to HDFS is described. Its APIs for creating, reading, updating, and deleting HBase tables are presented.

The course concludes with a presentation of Apache Storm which is used for processing near real-time streaming data with MapReduce.

Learning Objectives

After successfully completing this course, students will be able to:

  • Describe the architecture of Hadoop
  • Manage files and directories on HDFS
  • Explain the components of a MapReduce application on Hadoop
  • Implement and execute Apache Spark applications
  • Use the Hive Query Language (HQL) to analyze HDFS data
  • Create mutable tables on HDFS with HBase
  • Process near real-time streaming data with Apache Storm


Examples are presented in Java, Python, and SQL. The exercises require the ability to make adaptations of code presented in the notes.

Who Should Attend

This course is intended for anyone wanting to understand how some of the major components of the Apache Hadoop MR ecosystem work including HDFS, YARN, MapReduce, Hive, HBase, Spark, and Storm. This is a hands-on course. The exercises are intended to give the participants first-hand experience with developing Big Data applications.

Hands-On Labs

The hands-on labs are a key learning element of this course. Each lab reinforces the material presented in lecture.

Course Outline


1. Overview of Big Data

  • What Is Big Data?
  • Big Data Use Cases
  • The Rise of the Data Center and Cloud Computing
  • MapReduce and Batch Data Processing
  • MapReduce and Near Real-Time (Stream) Processing
  • NoSQL Solutions for Persisting Big Data
  • The Big Data Ecosystem

2.  The Hadoop Distributed File System (HDFS)

  • Overview of HDFS
  • Launching HDFS in Pseudo-Distributed Mode
  • Core HDFS Services
  • Installing and Configuring HDFS
  • HDFS Commands
  • HDFS Safe Mode
  • Check Pointing HDFS
  • Federated and High Availability HDFS
  • Running a Fully-Distributed HDFS Cluster with Docker

3.  MapReduce with Hadoop

  • MapReduce from the Linux Command Line
  • Scaling MapReduce on a Cluster
  • Introducing Apache Hadoop
  • Overview of YARN
  • Launching YARN in Pseudo-Distributed Mode
  • Demonstration of the Hadoop Streaming API
  • Demonstration of MapReduce with Java

4.  Introduction to Apache Spark

  • Why Spark?
  • Spark Architecture
  • Spark Drivers and Executors
  • Spark on YARN
  • Spark and the Hive Metastore
  • Structured APIs, DataFrames, and Datasets
  • The Core API and Resilient Distributed Datasets (RDDs)
  • Overview of Functional Programming
  • MapReduce with Python

5.  Apache Hive

  • Hive as a Data Warehouse
  • Hive Architecture
  • Understanding the Hive Metastore and HCatalog
  • Interacting with Hive using the Beeline Interface
  • Creating Hive Tables
  • Loading Text Data Files into Hive
  • Exploring the Hive Query Language
  • Partitions and Buckets
  • Built-in and Aggregation Functions
  • Invoking MapReduce Scripts from Hive
  • Common File Formats for Big Data Processing
  • Creating Avro and Parquet Files with Hive
  • Creating Hive Tables from Pig
  • Accessing Hive Tables with the Spark SQL Shell

6.  Persisting Data with Apache HBase

  • Features and Use Cases
  • HBase Architecture
  • The Data Model
  • Command Line Shell
  • Schema Creation
  • Considerations for Row Key Design

7.  Apache Storm

  • Processing Real-Time Streaming Data
  • Storm Architecture: Nimbus, Supervisors, and ZooKeeper
  • Application Design: Topologies, Spouts, and Bolts

Appendix A: Apache Pig

  • Declarative vs. Procedural
  • Role of Pig
  • Setting Up Pig
  • Loading and Working with Data
  • Writing a Pig Script
  • Executing Pig in Local and Hadoop Mode
  • Filtering Results
  • Storing, Loading, Dumping

Appendix B: Getting the Most Out of Pig

  • Relations, Tuples, Fields
  • Pig Data Types
  • Tuples, Bags, and Maps
  • Flatten on Bags and Tuples
  • Join and Union
  • Regular Expressions

Please Contact Your ROI Representative to Discuss Course Tailoring!