Logging, Monitoring, and Observability in GCP

(3 days)

 

Course Description

This three-day instructor-led course teaches participants techniques for monitoring, troubleshooting, and improving infrastructure and application performance in Google Cloud Platform (GCP). Guided by the principles of Site Reliability Engineering (SRE), and using a combination of presentations, demos, hands-on labs, and real-world case studies, attendees gain experience with full-stack monitoring, real-time log management and analysis, debugging code in production, tracing application performance bottlenecks, and profiling CPU and memory usage.

Objectives

This course teaches participants the following skills:

  • Plan and implement a well-architected logging and monitoring infrastructure
  • Define Service Level Indicators (SLIs) and Service Level Objectives (SLOs)
  • Create effective monitoring dashboards and alerts
  • Monitor, troubleshoot, and improve GCP infrastructure
  • Analyze and export GCP audit logs
  • Find production code defects, identify bottlenecks, and improve performance
  • Optimize monitoring costs

Audience

This class is intended for the following participants:

  • Cloud architects, administrators, and SysOps personnel
  • Cloud developers and DevOps personnel

Prerequisites

To get the most out of this course, participants should have:

  • Google Cloud Platform Fundamentals: Core Infrastructure or equivalent experience
  • Basic scripting or coding ability
  • Proficiency with command-line tools and Linux operating system environments

Course Outline

 

Module 1: Defining a Monitoring Plan

  • Understand the four golden signals: latency, traffic, errors, and
    saturation
  • Define SLIs (measures of customer pain)
  • Define critical performance measures
  • Define SLOs and SLAs
  • Define error budgets

Module 2: Introduction to GCP Monitoring Tools

  • Understand the purpose and capabilities of GCP operations-focused
    components [Logging, Monitoring, Error Reporting, and Service
    Monitoring]
  • Understand the purpose and capabilities of GCP application
    performance management focused components (Debugger, Trace,
    Profiler)

Module 3: Monitoring Critical Systems

  • Use the default dashboards appropriately
  • Build custom dashboards to show resource consumption and
    application load
  • Define uptime checks to track aliveness and latency

Module 4: Alerting Policies

  • Defining alerting policies
  • Defining alerts based on policy violations
  • Optimize alerts for actionability
  • Know types of alerts and common uses for each
  • Implement best practices for alerting policies
  • Define and alert on resource groups
  • Manage alerting policies programmatically using GCP Monitoring API

Module 5: Configuring GCP Services for Observability

  • Define the monitoring project architecture in accordance with best practices
  • Define Cloud IAM roles for monitoring
  • Define labels and tags for resources
  • Bake agents into VM images for app visibility in Compute Engine
  • Install Kubernetes Monitoring
  • Expose app data for Kubernetes Engine apps using Prometheus and OpenCensus

Module 6: Advanced Logging and Analysis

  • Know and choose among resource tagging approaches
  • Connect application errors to Logging using Error Reporting
  • Define log sinks (inclusion filters) and exclusion filters; understand the batch-vs.-realtime nature of data availability in log sinks
  • Create metrics based on logs
  • Define custom metrics
  • Export logs to BigQuery
  • Analyze logs using BigQuery

Module 7: Analyzing GCP Audit Logs

  • Use Admin Activity audit logs to track changes to the configuration or
    metadata of resources
  • Use Data Access audit logs to track accesses or changes to
    user-provided resource data
  • Use System Event audit logs to track GCP administrative actions

Module 8: Managing Incidents

  • Define incident management roles and communication channels
  • Mitigate incident impact
  •  Troubleshoot root causes
  •  Resolve incidents
  •  Document incidents in a post-mortem process

Module 9: Investigating Application Performance Issues

  • Use Debugger to identify code defects in production
  • Use Trace to find performance bottlenecks in production
  • Use Profiler to find resource-intensive functions in an application

Module 10: Optimizing the Costs of Monitoring

  • Understand the billing of monitoring components within GCP
  • Analyze the resource utilization of monitoring components within GCP
  • Implement best practices for controlling the cost of monitoring within GCP