Course 433: Analyzing Big Data Using Hadoop, Hive, Spark, and HBase

Analyzing Big Data Using Hadoop, Hive, Spark, and HBase

Curriculum

Big Data and Machine Learning

Delivery methods

On-Site, Virtual

Duration

4 days

This course starts with an overview of Big Data and its role in the enterprise. It then presents the Hadoop Distributed File System (HDFS) which is a foundation for much of the other Big Data technology shown in the course. Hadoop MapReduce is then introduced and simple MapReduce applications are demonstrated using both the streaming and Java APIs.

At this point, the stage has been set to introduce Apache Spark on YARN as a highly performant and flexible platform for cluster computing. Spark’s architecture and APIs are presented with an emphasis on mining HDFS data with MapReduce.

The focus of the course then shifts to using Hadoop as a data warehouse platform. The first technology looked at from this perspective is Apache Hive. Hive allows clients to access HDFS files as if they were relational tables. This is accomplished using an SQL-like query language called the Hive Query Language (HQL). The course gives an overview of HQL and shows how table metadata can be accessed by other applications such as Spark.

This is followed by a discussion of the HBase column-family database. The HBase architecture and data model and their relationship to HDFS is described. Its APIs for creating, reading, updating, and deleting HBase tables are presented.

The course concludes with a presentation of Apache Storm which is used for processing near real-time streaming data with MapReduce.

Learning objectives

After successfully completing this course, students will be able to:

Describe the architecture of Hadoop
Manage files and directories on HDFS
Explain the components of a MapReduce application on Hadoop
Implement and execute Apache Spark applications
Use the Hive Query Language (HQL) to analyze HDFS data
Create mutable tables on HDFS with HBase
Process near real-time streaming data with Apache Storm

Prerequisites

Examples are presented in Java, Python, and SQL. The exercises require the ability to make adaptations of code presented in the notes.

Who should attend

This course is intended for anyone wanting to understand how some of the major components of the Apache Hadoop MR ecosystem work including HDFS, YARN, MapReduce, Hive, HBase, Spark, and Storm. This is a hands-on course. The exercises are intended to give the participants first-hand experience with developing Big Data applications.

Hands-on labs

The hands-on labs are a key learning element of this course. Each lab reinforces the material presented in lecture.

Course outline

1Overview of Big Data

What Is Big Data?
Big Data Use Cases
The Rise of the Data Center and Cloud Computing
MapReduce and Batch Data Processing
MapReduce and Near Real-Time (Stream) Processing
NoSQL Solutions for Persisting Big Data
The Big Data Ecosystem

2The Hadoop Distributed File System (HDFS)

Overview of HDFS
Launching HDFS in Pseudo-Distributed Mode
Core HDFS Services
Installing and Configuring HDFS
HDFS Commands
HDFS Safe Mode
Check Pointing HDFS
Federated and High Availability HDFS
Running a Fully-Distributed HDFS Cluster with Docker

3MapReduce with Hadoop

MapReduce from the Linux Command Line
Scaling MapReduce on a Cluster
Introducing Apache Hadoop
Overview of YARN
Launching YARN in Pseudo-Distributed Mode
Demonstration of the Hadoop Streaming API
Demonstration of MapReduce with Java

4Introduction to Apache Spark

Why Spark?
Spark Architecture
Spark Drivers and Executors
Spark on YARN
Spark and the Hive Metastore
Structured APIs, DataFrames, and Datasets
The Core API and Resilient Distributed Datasets (RDDs)
Overview of Functional Programming
MapReduce with Python

5Apache Hive

Hive as a Data Warehouse
Hive Architecture
Understanding the Hive Metastore and HCatalog
Interacting with Hive using the Beeline Interface
Creating Hive Tables
Loading Text Data Files into Hive
Exploring the Hive Query Language
Partitions and Buckets
Built-in and Aggregation Functions
Invoking MapReduce Scripts from Hive
Common File Formats for Big Data Processing
Creating Avro and Parquet Files with Hive
Creating Hive Tables from Pig
Accessing Hive Tables with the Spark SQL Shell

6Persisting Data with Apache HBase

Features and Use Cases
HBase Architecture
The Data Model
Command Line Shell
Schema Creation
Considerations for Row Key Design

7Apache Storm

Processing Real-Time Streaming Data
Storm Architecture: Nimbus, Supervisors, and ZooKeeper
Application Design: Topologies, Spouts, and Bolts

8Apache Pig

Declarative vs. Procedural
Role of Pig
Setting Up Pig
Loading and Working with Data
Writing a Pig Script
Executing Pig in Local and Hadoop Mode
Filtering Results
Storing, Loading, Dumping

9Getting the Most Out of Pig

Relations, Tuples, Fields
Pig Data Types
Tuples, Bags, and Maps
Flatten on Bags and Tuples
Join and Union
Regular Expressions

Ready to accelerate your team's innovation?

Schedule a meeting

Unlock your team’s potential and get the most from your tech stack