Course 975:
SRE Essentials

(2 or 3 days)

 

Course Description

Site reliability engineering (SRE) is a software engineering approach to IT infrastructure and operations that align incentives between development and operations and also includes mission-critical production support. This course starts with an introduction to the main practices of SRE and the role IT and business leaders play in the success of SRE adoption. The course then introduces participants to how Service Level Indicators (SLIs) and Service Level Objectives (SLOs) should be used to measure a service’s reliability. Attendees will gain some hands-on experience with creating these measures in practice. These concepts help create a culture where the reliability and success of a service can be objectively measured. 

 Learning Objectives

  • Articulate the technical and cultural fundamentals of Site Reliability Engineering (SRE) and understand the value they can provide to your IT operations in any environment
  • Learn SRE terminology
  • Understand why services need Service Level Objectives (SLOs)
  • Achieve developer and operation harmony with error budgets
  • Choose appropriate Service Level Indicators (SLIs) based on user journeys
  • Create specific, measurable, achievable, relevant, and time-bound SLOs
  • Leverage automated tools to reduce toil for development, testing, and operations
  • Tips for alerting on SLOs
  • Leverage postmortems to learn from failure

Who Should Attend

This course is aimed at development and operations engineers and technical managers, but can also be useful for product and business leaders wanting to learn more about what SRE is.

Activities

This course contains a mixture of design and hands-on activities to provide real-life experience with creating these SRE measures in practice. These concepts help create a culture where the reliability and success of a service can be objectively measured. 

  • Improving IT Operations
  • Moving to an SRE Culture
  • DORA DevOps Quick Check
  • SLI Failure Gaps
  • Define SLI and SLO Targets
  • Creating an Automated Build
  • Adding Tests to a Build
  • Deploying App Versions
  • Adding Monitoring to an Application
  • Creating a Postmortem

Course Outline

 

1. Introduction to SRE

  • Production Systems
  • Reduce Organizational Silos
  • Improving IT Operations
  • DevOps and SRE
  • Review the Relationship Between DevOps and SRE
  • Moving to an SRE Culture
  • Apply SRE in Your Organization

2. The Art of SLOs

  • Review SRE Terminology
  • Recognize Why Services Need SLOs
  • Incentivizing Reliability Across DevOps Teams with Solid SRE Practices
  • Choose Appropriate SLIs Based on User Journeys
  • Create SMART (Specific, Measurable, Achievable, Relevant, and Time-bound) SLOs
  • Calculating and Leveraging Error Budgets

3. Eliminating Toil

  • Define Toil
  • Recognize Toil for Developers, Testing, and Operations
  • Leverage Automated Tools, Builds, and Testing to Eliminate Toil
  • Manage and Monitor Services and Versions Deployed to Production

4. Monitoring

  • Define Monitoring Terminology
  • Set Expectations for Monitoring
  • Review Tips for Alerting on SLOs

5. Learning from Failure

  • Leverage Postmortems to Learn from Failure
  • Review Postmortem Checklists
  • Begin Writing a Postmortem for an Incident

6. Capstone Project (optional third day of training)

  • Identify SLIs, SLOs, and Error Budgets
  • Identify Measurement Strategies
  • Debug, Troubleshoot, and Diagnose
  • Postmortem
  • Identify Improvements to the Application Deployment

Please Contact Your ROI Representative to Discuss Course Tailoring!