Course 975J: SRE Foundations

SRE Foundations

Curriculum

Cloud Computing

Delivery methods

On-Site, Virtual

Duration

1 day

Site reliability engineering (SRE) is a software engineering approach to IT infrastructure and operations that aligns incentives between development and operations and includes mission-critical production support. This course starts with an introduction to the main practices of SRE and the role IT and business leaders play in the success of SRE adoption. The course then introduces participants to how Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets should be used to measure a service’s reliability. Attendees will gain experience with creating these measures in practice. These concepts help create a culture where the reliability and success of a service can be objectively measured.

Learning Objectives

Articulate the technical and cultural fundamentals of Site Reliability Engineering (SRE) and understand the value they can provide to your IT operations in any environment
Learn SRE terminology
Understand why services need Service Level Objectives (SLOs)
Achieve developer and operation harmony with error budgets
Choose appropriate Service Level Indicators (SLIs) based on user journeys
Create specific, measurable, achievable, relevant, and time-bound SLOs
Explore techniques to reduce toil for development, testing, and operations
Review containerization and microservice architecture

Who Should Attend

This course is aimed at development and operations engineers and technical managers, but can also be useful for product and business leaders wanting to learn more about what SRE is.

Activities

This course contains several design activities to provide real-life experience with creating these SRE measures in practice. These concepts help create a culture where the reliability and success of a service can be objectively measured.

Improving IT Operations
Moving to an SRE Culture
DORA DevOps Quick Check
SLI Failure Gaps
Define SLI and SLO Targets

Course outline

1Introduction to SRE

Production Systems
Reduce Organizational Silos
Improving IT Operations
What Is DevOps? DevSecOps? SRE?
Review the Relationship Between DevOps and SRE
Shift Left on Security
Moving to an SRE Culture
Apply SRE in Your Organization

2The Art of SLOs

Review SRE Terminology
SLIs, SLOs, SLAs, Error Budgets
Recognize Why Services Need SLOs
Incentivizing Reliability Across DevOps Teams with Solid SRE Practices
Choose Appropriate SLIs Based on User Journeys
Create SMART (Specific, Measurable, Achievable, Relevant, and Time-bound) SLOs
Calculating and Leveraging Error Budgets

3Eliminating Toil

Define Toil
Recognize Toil for Developers, Testing, and Operations
Leverage Automated Tools, Builds, and Testing to Eliminate Toil
Reducing Developer Toil with Source Code Management (SCM)
Reducing Operations Toil with Infrastructure as Code (IaC)?

4Containerization

Containerization vs. Virtual Machines
Advantages of Containers
Building Container Images: Docker
Monolithic vs. Microservice Applications
Understanding Container Orchestration: Kubernetes

Ready to accelerate your team's innovation?

Schedule a meeting

Unlock your team’s potential and get the most from your tech stack