Course 435:
Creating Distributed Applications
Using Pig and Hive on Hadoop

(4 days)

 

Course Description

Students who attend this course will leave armed with the skills they require to develop reliable, scalable, distributed applications using Apache Hadoop. The course is useful for Java programmers who need to create MapReduce applications, for data architects who can program in scripted languages like Python or Perl and perform analysis using statistical packages such as R, for data analysts looking for ways to handle Big Data effectively, and for database developers wishing to build data pipelines with Pig or do data warehousing with Hive. It is also useful for system and data administrators tasked with maintaining Hadoop, Pig, and/or Hive applications since it provides them a deep understanding of how Hadoop applications communicate and share data.

Learning Objectives

After successfully completing this course, students will be able to:

  • Develop reliable, scalable, distributed applications using Hadoop, Pig, and Hive
  • Implement a Map Reduce application in Python, Perl, or Java
  • Manage and monitor memory, parameters, and jobs
  • Distribute application data and storage using HDFS
  • Data replication, organization, and reclamation
  • Create a Big Data analysis pipeline using Pig
  • Provide Big Data warehousing capabilities using Hive
  • Write Pig Latin scripts and define custom Pig functions
  • Write Hive queries and partition data for efficient querying

Who Should Attend

Anyone wishing to develop, maintain, or deploy scalable, distributed applications that analyze Big Data.

Prerequisites

Experience in Java and data administration is helpful.

Hands-On Labs

The hands-on labs are a key learning element of this course. Each lab reinforces the material presented in lecture.


Course Outline

Apache Spark

  • Why Spark?
  • Spark Standalone vs. Spark on YARN
  • Spark’s Resilient Distributed Datasets (RDDs)
  • Spark Drivers and Executors with YARN
  • Analyzing Data with Spark Streaming
  • Using Spark’s Query Language Spark SQL
  • Graph Processing with Spark GraphX
  • Machine Learning with MLib

Apache Storm

  • Processing Real-Time Streaming Data
  • Storm Architecture: Nimbus, Supervisors, and ZooKeeper
  • Application Design: Topologies, Spouts, and Bolts
  • Real-Time Processing of Word Count
  • Transactional Topologies
  • Storm on YARN
  • Graph Processing with Storm

Apache Pig

  • Declarative vs. Procedural
  • Role of Pig
  • Setting Up Pig
  • Loading and Working with Data
  • Writing a Pig Script
  • Executing Pig in Local and Hadoop Mode
  • Filtering Results
  • Storing, Loading, Dumping

Getting the Most Out of Pig

  • Relations, Tuples, Fields
  • Pig Data Types
  • Tuples, Bags and Maps
  • Flatten on Bags and Tuples
  • Join and Union
  • Regular Expressions

Apache Hive

  • What Is Hive?
  • Pig vs. Hive
  • Ad-hoc Queries
  • Creating Hive Tables
  • Loading Data from Flat Files into Hive
  • SQL Queries
  • Partitions and Buckets
  • Built-in and Aggregation Functions
  • MapReduce Scripts

Please Contact Your ROI Representative to Discuss Course Tailoring!