Course 435:
Creating Distributed Applications
Using Pig and Hive on Hadoop

(2 days)


Course Description

Students who attend this course will learn how to intuitively create MapReduce pipelines using Apache Pig, mine data sets using the Apache Hive query language (HQL), an SQL-like API for data warehousing, and execute near real-time MapReduce processing with Apache Storm.

Learning Objectives

After successfully completing this course, students will be able to:

  • Create data processing pipelines with Pig
  • Answer data mining questions using Pig Latin
  • Create and query a Big Data warehouse with Hive
  • Learn how Apache Storm supports near real-time processing

Who Should Attend

This course is intended for anyone wanting to understand how some of the major components of the Apache Hadoop MapReduce ecosystem work, including Apache Pig, Hive, and Storm.


Attendance in course 434, “Hadoop for MapReduce Applications”, or equivalent experience using Hadoop is assumed.

Hands-On Labs

The hands-on labs are a key learning element of this course. Each lab reinforces the material presented in lecture.

Course Outline

1. Apache Pig

  • Declarative vs. Procedural
  • Role of Pig
  • Setting Up Pig
  • Loading and Working with Data
  • Writing a Pig Script
  • Executing Pig in Local and Hadoop Mode
  • Filtering Results
  • Storing, Loading, Dumping

2. Getting the Most Out of Pig

  • Relations, Tuples, Fields
  • Pig Data Types
  • Tuples, Bags, and Maps
  • Flatten on Bags and Tuples
  • Join and Union
  • Regular Expressions

3.  Apache Hive

  • Hive as a Data Warehouse
  • Hive Architecture
  • Understanding the Hive Metastore and HCatalog
  • Interacting with Hive Using the Beeline Interface
  • Creating Hive Tables
  • Loading Text Data Files into Hive
  • Exploring the Hive Query Language
  • Partitions and Buckets
  • Built-in and Aggregation Functions
  • Invoking MapReduce Scripts from Hive
  • Common File Formats for Big Data Processing
  • Creating Avro and Parquet Files with Hive
  • Creating Hive Tables from Pig
  • Accessing Hive Tables with the Spark SQL Shell

4.  Apache Storm

  • Processing Real-Time Streaming Data
  • Storm Architecture: Nimbus, Supervisors, and ZooKeeper
  • Application Design: Topologies, Spouts, and Bolts

Please Contact Your ROI Representative to Discuss Course Tailoring!