Course 975: SRE Essentials

SRE Essentials

Curriculum

Cloud Computing

Delivery methods

On-Site, Virtual

Duration

2-3 days

Site reliability engineering (SRE) is a software engineering approach to IT infrastructure and operations that align incentives between development and operations and also includes mission-critical production support. This course starts with an introduction to the main practices of SRE and the role IT and business leaders play in the success of SRE adoption. The course then introduces participants to how Service Level Indicators (SLIs) and Service Level Objectives (SLOs) should be used to measure a service’s reliability. Attendees will gain some hands-on experience with creating these measures in practice. These concepts help create a culture where the reliability and success of a service can be objectively measured.

Learning objectives

Articulate the technical and cultural fundamentals of Site Reliability Engineering (SRE) and understand the value they can provide to your IT operations in any environment
Learn SRE terminology
Understand why services need Service Level Objectives (SLOs)
Achieve developer and operation harmony with error budgets
Choose appropriate Service Level Indicators (SLIs) based on user journeys
Create specific, measurable, achievable, relevant, and time-bound SLOs
Leverage automated tools to reduce toil for development, testing, and operations
Tips for alerting on SLOs
Leverage postmortems to learn from failure

Who should attend

This course is aimed at development and operations engineers and technical managers, but can also be useful for product and business leaders wanting to learn more about what SRE is.

Activities

This course contains a mixture of design and hands-on activities to provide real-life experience with creating these SRE measures in practice. These concepts help create a culture where the reliability and success of a service can be objectively measured.

Improving IT Operations
Moving to an SRE Culture
DORA DevOps Quick Check
SLI Failure Gaps
Define SLI and SLO Targets
Creating an Automated Build
Adding Tests to a Build
Deploying App Versions
Adding Monitoring to an Application
Creating a Postmortem

Course outline

1Introduction to SRE

Production Systems
Reduce Organizational Silos
Improving IT Operations
DevOps and SRE
Review the Relationship Between DevOps and SRE
Moving to an SRE Culture
Apply SRE in Your Organization

2The Art of SLOs

Review SRE Terminology
Recognize Why Services Need SLOs
Incentivizing Reliability Across DevOps Teams with Solid SRE Practices
Choose Appropriate SLIs Based on User Journeys
Create SMART (Specific, Measurable, Achievable, Relevant, and Time-bound) SLOs
Calculating and Leveraging Error Budgets

3Eliminating Toil

Define Toil
Recognize Toil for Developers, Testing, and Operations
Leverage Automated Tools, Builds, and Testing to Eliminate Toil
Manage and Monitor Services and Versions Deployed to Production

4Monitoring

Define Monitoring Terminology
Set Expectations for Monitoring
Review Tips for Alerting on SLOs

5Learning from Failure

Leverage Postmortems to Learn from Failure
Review Postmortem Checklists
Begin Writing a Postmortem for an Incident

6Capstone Project (optional third day of training)

Identify SLIs, SLOs, and Error Budgets
Identify Measurement Strategies
Debug, Troubleshoot, and Diagnose
Postmortem
Identify Improvements to the Application Deployment

Ready to accelerate your team's innovation?

Schedule a meeting

Unlock your team’s potential and get the most from your tech stack