Course 790:
Google Cloud Dataflow Programming and Pub/Sub

(2 days)

 

Course Description

This instructor-led class introduces participants to Google Dataflow. Through a combination of instructor-led presentations, demonstrations, and hands-on labs, students learn how to use Dataflow to extract, transform, and load data from multiple data sources and into Google BigQuery for analysis.

Objectives

At the end of this course, participants will be able to:

  • Integrate Dataflow into a big data processing strategy
  • Program batch and streaming Dataflow pipelines using the SDK
  • Transform data with PCollections
  • Execute and monitor Dataflow jobs
  • Test and debug Dataflow pipelines

Who Should Attend

Data Analysts, Data Scientists, and Programmers.

Prerequisites

Before attending this course, you should have:

  • Completed CP100A: Google Cloud Platform Fundamentals
  • Familiarity with extract, transform, and load (ETL) activities
  • Some programming experience in Java or a similar language

Modules

1.  Google Cloud Dataflow Overview

  • Data Flows
    • What Is a Data Flow?
    • Batch vs. Streaming
  • Examples of Data Flows
    • Moving Data
    • Moving and Transforming Data
    • Multiple Inputs
    • Multiple Outputs
  • Google Cloud Dataflow
    • What Is Google Cloud Dataflow?
    • Features
    • Integration with GCP
    • Types of Input
    • Types of Output
    • SDK
    • Core Types
    • Activity: Designing Real-World Data Flows

2.  Dataflow Pipelines

  • Creating Dataflow Projects
    • Creating a Project
    • Initializing gcloud
    • Dependencies
    • Creating a Dataflow Project with Maven
    • Pipelines
    • Pipeline Options
    • Creating a PCollection
    • Input-Transform-Output
    • Running Pipelines Locally
    • Exercise 1: Creating a Simple Data Flow
  • Testing and Debugging Pipelines
    • Output
    • Logging
    • Unit Testing
    • Outputting Results
    • Exercise 2: Testing, Logging, and Outputting Pipelines
  • Running Pipelines in Google Cloud Dataflow
    • Monitoring Progress
    • Instances
    • Viewing Logs
    • Exercise 3: Running Data Flows in the Cloud

3. Programming the Dataflow SDK

  • PCollections
    • PCollection Characteristics
    • Generic Types
    • Creating PCollections from Memory
    • Read Transforms
    • Text.IO Read Example
    • Writing Transforms
    • Text.IO Write Example
    • Exercise 4: Using Text.IO to Read and Write Data
  • Basic Transforms
    • ParDo
    • Input and Output Types
    • Side Inputs
    • Side Outputs
    • .apply()
    • Chaining Transforms
    • Count
    • Formatting Results
    • Writing Output
    • Exercise 5: Chaining Transforms
  • GroupByKey Transform
    • GroupByKey Example
    • Windowing
    • Exercise 6: Grouping Output by Key
  • Composite Transforms
    • Combining Multiple Transforms
    • PTransform
    • Overriding apply()
    • Exercise 7: Write Composite Transforms
  • Multiple Inputs
    • Multiple Collections for the Same Type
    • PCollectionList
    • Flatten
    • Multiple Collections of Different Types
    • PCollectionTuple
    • Exercise 8: Handling Multiple Inputs

4. Integrating Dataflow with BigQuery

  • BigQuery
    • BigQuery Overview
    • BigQuery Projects
    • DataSets and Tables
    • Table Schemas
    • Table Data Sources
    • Table Source Format
  • Writing to BigQuery from Dataflow
    • Defining Table Schemas
    • Outputting Dataflow Jobs to BigQuery Tables
    • BigQuery.IO Example
    • Exercise 9: Outputting to BigQuery

5. Streaming Data Flows with Pub/Sub

  • Pub/Sub
    • What Is PubSub?
    • Topics and Subscriptions
    • Push vs. Pull Subscriptions
  • Using Pub/Sub with Http
    • Integrating Pub/Sub with HTTP Clients
    • Scopes and Authentication Tokens
    • Creating Topics and Subscriptions with Http
    • Publishing and Receiving Messages
    • Acknowledging Message Receipt
  • Using Pub/Sub with AppEngine
    • Setting up Pub/Sub Endpoints in AppEngine Apps
    • Programming Pub/Sub with the Python SDK
    • Using Web Hooks to Process Pub/Sub Messages
  • Using Pub/Sub with Dataflow
    • Reading from PubSub
    • PubSub.IO Read Example
    • Writing to PubSub
    • PubSub.IO Write Example
    • Windowing
    • Running Dataflow Streaming Jobs
    • Exercise 10: Creating a Streaming Data Flow

6. Summary

 

Please Contact Your ROI Representative to Discuss Course Tailoring!