Course 975:
SRE Essentials

(2 days)

 

Course Description

Site reliability engineering (SRE) is a software engineering approach to IT infrastructure and operations that align incentives between development and operations and also includes mission-critical production support. This workshop starts with an introduction of the main practices of SRE and the role IT and business leaders play in the success of SRE adoption. The course then introduces participants to the way Service Level Indicators (SLIs) and Service Level Objectives (SLOs) should be used to measure a service’s reliability. Attendees will gain some hands-on experience with creating these measures in practice. These concepts help create a culture where the reliability and success of a service can be objectively measured. 

 Learning Objectives

  • Articulate the technical and cultural fundamentals of Site Reliability Engineering (SRE) and understand the value they can provide to your IT operations in any environment
  • Learn SRE terminology
  • Understand why services need Service Level Objectives (SLOs)
  • Achieve developer and operation harmony with error budgets
  • Choose appropriate Service Level Indicators (SLIs) based on user journeys
  • Create specific, measurable, achievable, relevant, and time-bound SLOs
  • Leverage automated tools to reduce toil for development, testing, and operations
  • Tips for alerting on SLOs
  • Leverage postmortems to learn from failure

Who Should Attend

This workshop is aimed at development and operations engineers and technical managers, but can also be useful for product and business leaders wanting to learn more about what SRE is.


Course Outline

 

Unit 1: SRE Cultural Practices

  • Production Systems
  • Reduce Organizational Silos
  • Improving IT Operations
  • DevOps and SRE
  • Review the Relationship Between DevOps and SRE
  • Moving to an SRE Culture
  • Apply SRE in Your Organization

Unit 2: The Art of SLOs

  • Review SRE Terminology
  • Recognize Why Services Need SLOs
  • Incentivizing Reliability Across DevOps Teams with Solid SRE Practices
  • Choose Appropriate SLIs Based on User Journeys
  • Create Specific, Measurable, Achievable, Relevant, and Time-bound SLOs
  • Calculating and Leveraging Error Budgets

Unit 3: Eliminating Toil

  • Define Toil
  • Recognize Toil for Developers, Testing, and Operations
  • Leverage Automated Tools, Builds, and Testing to Eliminate Toil
  • Manage and Monitor Services and Versions Deployed to Production

Unit 4: Monitoring

  • Define Monitoring Terminology
  • Set Expectations for Monitoring
  • Review Tips for Alerting on SLOs

Unit 5: Learning from Failure

  • Leverage Postmortems to Learn from Failure
  • Review Postmortem Checklists
  • Begin Writing a Postmortem for an Incident

 

Please Contact Your ROI Representative to Discuss Course Tailoring!