Data Engineering on Google Cloud

Learning Track

Data Engineering and Analytics

Delivery methods

On-Site, Virtual

Duration

4 days

This four-day instructor-led class provides participants a hands-on introduction to designing and building data processing systems on Google Cloud. This course uses lectures, demos, and hands-on labs to show you how to design data processing systems, build end-to-end data pipelines, analyze data, and implement machine learning. This course covers structured, unstructured, and streaming data.

Objectives

This course teaches participants the following skills:

Design scalable data processing systems in Google Cloud.
Differentiate data architectures and implement data lakehouse and pipeline concepts.
Build and manage robust streaming and batch data pipelines.
Utilize AI/ML tools to optimize performance and gain process and data insights.

Audience

This class is intended for experienced Data Engineers, Data Analysits, and Data Architects who are responsible for managing big data transformations including:

Extracting, Loading, Transforming, cleaning, and validating data
Designing pipelines and architectures for data processing
Creating and maintaining machine learning and statistical models
Querying datasets, visualizing query results, and creating reports

Prerequisites

To get the most of out of this course, participants should have:

Completed Google Cloud Fundamentals: Big Data and Machine Learning course OR have equivalent experience
Basic proficiency with common query language such as SQL
Experience with data modeling, extract, transform, load activities
Experience with developing applications using a common programming language such as Python
Familiarity with Machine Learning and/or statistics

1Data Engineering Tasks and Components

Explain the role of a data engineer.
Understand the differences between a data source and a data sink.
Explain the different types of data formats.
Explain the storage solution options on Google Cloud.
Learn about the metadata management options on Google Cloud.
Understand how to share datasets with ease using Analytics Hub.
Understand how to load data into BigQuery using the Google Cloud console or the gcloud CLI.
Lab: Loading Data into BigQuery
Quiz

2Data Replication and Migration

Explain the baseline Google Cloud data replication and migration architecture.

Understand the options and use cases for the gcloud command-line tool.
Explain the functionality and use cases for Storage Transfer Service.
Explain the functionality and use cases for Transfer Appliance.
Understand the features and deployment of Datastream.
Activities:
- Explain the baseline Google Cloud data replication and migration architecture.
- Understand the options and use cases for the gcloud command-line tool.
- Explain the functionality and use cases for Storage Transfer Service.
- Explain the functionality and use cases for Transfer Appliance.
- Understand the features and deployment of Datastream.

3The Extract and Load Data Pipeline Pattern

Explain the baseline extract and load architecture diagram.
Understand the options of the bq command-line tool.
Explain the functionality and use cases for BigQuery Data Transfer Service.
Explain the functionality and use cases for BigLake as a non-extract-load pattern.
Lab: BigLake: Qwik Star
Quiz

4The Extract, Load, and Transform Data Pipeline Pattern

Explain the baseline extract, load, and transform architecture diagram.

Understand a common ELT pipeline on Google Cloud.

Learn about BigQuery’s SQL scripting and scheduling capabilities.

Explain the functionality and use cases for Dataform.

Lab: Create and Execute a SQL Workflow in Dataform

Quiz

5The Extract, Transform, and Load Data Pipeline Pattern

Explain the baseline extract, transform, and load architecture diagram.
Learn about the GUI tools on Google Cloud used for ETL data pipelines. • Explain batch data processing using Dataproc.
Learn how to use Dataproc Serverless for Spark for ETL.
Explain streaming data processing options.
Explain the role Bigtable plays in data pipelines.
Lab: Use Dataproc Serverless for Spark to Load BigQuery (optional for ILT)
Lab: Creating a Streaming Data Pipeline for a Real-Time Dashboard with Dataflow

6Automation Techniques

Explain the automation patterns and options available for pipelines.

Learn about Cloud Scheduler and Workflows.

Learn about Cloud Composer.

Learn about Cloud Run functions.

Explain the functionality

Lab: Use Cloud Run Functions to Load BigQuery (optional)

Quiz

7Introduction to Modern Data Engineering on Google Cloud

Compare and contrast data lake, data warehouse, and data lakehouse architectures
Evaluate the benefits of the lakehouse approach
Quiz

8Building a data lakehouse with Cloud Storage, open formats, and BigQuery

Discuss data storage options, including Cloud Storage for files, open table formats like Apache Iceberg, BigQuery for analytic data, and AlloyDB for operational data.

Understand the role of AlloyDB for operational data use cases.

Quix

Lab: Federated Query with BigQuery

9Modernizing Data Warehouses with BigQuery and BigLake

Explain why BigQuery is a scalable data warehousing solution on Google Cloud.

Discuss the core concepts of BigQuery.

Understand BigLake's role in creating a unified lakehouse architecture and its integration with BigQuery for external data.

Learn how BigQuery natively interacts with Apache Iceberg tables via BigLake.

Quiz

Lab: Querying External Data and Iceberg Tables

10Advanced lakehouse patterns and data governance

Implement robust data governance and security practices across the unified data platform, including sensitive data protection and metadata management.
Explore advanced analytics and machine learning directly on lakehouse data
Quiz

11Labs and best practices

Reinforce the core principles of Google Cloud's data platform
Lab: Getting Started with BigQuery ML
Lab: Vector Search with BigQuery

12When to choose batch data pipelines

Explain the critical role of a data engineer in developing and maintaining batch data pipelines.
Describe the core components and typical lifecycle of batch data pipelines from ingestion to downstream consumption.
Analyze common challenges in batch data processing, such as data volume, quality, complexity, and reliability, and identify key Google Cloud services that can address them.
Quiz

13Design and Build Scalable Batch Data Pipelines

Design scalable batch data pipelines for high-volume data ingestion and transformation.
Optimize batch jobs for high throughput and cost-efficiency using various resource management and performance tuning techniques.
Quiz
Lab: Build a Simple Batch Data Pipeline with Serverless for Apache Spark (optional)
Lab: Build a Simple Batch Data Pipeline with Dataflow Job Builder UI (optional)

14Control Data Quality in Batch Data Pipelines

Develop data validation rules and cleansing logic to ensure data quality within batch pipelines.
Implement strategies for managing schema evolution and performing data deduplication in large datasets.
Lab: Validate Data Quality in a Batch Pipeline with Serverless for Apache Spark (optional)
Quiz

15Orchestrate and Monitor Batch Data Pipelines

Orchestrate complex batch data pipeline workflows for efficient scheduling and lineage tracking.
Implement robust error handling, monitoring, and observability for batch data pipelines.
Lab: Building Batch Pipelines in Cloud Data Fusion
Quiz

16Build Streaming Data Pipelines on Google Cloud introduction

Introduce the course learning objectives, and the scenario that will be used to bring hands on learning to building streaming data pipelines
Describe the concept of streaming data pipelines, challenges associated with it, and the role of these pipelines within the data engineering process.

17Streaming use cases and reference architectures

Understand various streaming use cases and their applications, including Streaming ETL, Streaming AI/ML, Streaming Application, and Reverse ETL
Identify and describe common sample architectures for streaming data, including Streaming ETL, Streaming AI/ML, Streaming Application, and Reverse ETL.
Quiz

18Product deep dives

Pub/Sub and Managed Service for Apache Kafka
- Define messaging concepts
- Know when to use Pub/Sub or Managed Service for Apache Kafka
Dataflow
- Describe the service and challenges with streaming data
- Build and deploy a streaming pipeline
BigQuery
- Explore various data ingestion methods
- Use BigQuery continuous queries, BigQuery ETL, and reverse ETL
- Configure Pub/Sub to BigQuery streaming
- Architecting BigQuery streaming pipelines
Bigtable
- Describe the big picture of data movement and interaction
- Establish a streaming pipeline from Dataflow to Bigtable
- Analyze the Bigtable continuous data stream for trends using BigQuery
- Synchronize the trends analysis back into the user-facing application
Lab: Stream data with pipelines - Esports use case (optional)
Quiz
Lab: Use Apache Beam and Bigtable to enrich esports downloadable content (DLC) data
Quiz
Lab: Stream e-sports data with Pub/Sub and BigQuery
Quiz
Lab: Monitor e-sports chat with Streamlit
Quiz