Course 975P: SRE Practitioner

SRE Practitioner

Contact us to book this course

Curriculum

Cloud Computing

Delivery methods

On-Site, Virtual

Duration

1 day

Site reliability engineering (SRE) is a software engineering approach to IT infrastructure and operations that aligns incentives between development and operations and includes mission-critical production support. This one-day course prepares participants for the DevOps Institute SRE Practitioner certification by covering core principles such as SLOs, error budgets, incident response, and reliability engineering. The course integrates modern topics including platform engineering, security, AIOps, and GenAI, with practical exercises and real-world scenarios.

Learning Objectives

Understand and apply core SRE principles, including SLOs, SLIs, and error budgets
Identify and avoid common anti-patterns in site reliability engineering
Design observability strategies that support proactive incident detection and resolution
Create an effective alerting strategy with error budget burn rates
Practice blameless incident response and create effective postmortems
Integrate security considerations into reliability, automation, and monitoring workflows
Leveraging Chaos Engineering to find a mitigate reliability issues
Strategies to reduce toil: what to automate, when, and how
Explore practical uses of GenAI for SRE tasks such as postmortem generation and knowledge sharing

Who Should Attend

This course is designed to help prepare for the DevOps Institute SRE Practitioner certification. It is aimed at development and operations engineers, platform engineers, and technical managers, but can also be useful for product and business leaders looking to formalize and advance their SRE expertise. Prior SRE knowledge is assumed at the level of course 975J: SRE Foundations.

Activities

This course contains several design activities (group or individual) to provide real-life experience with creating these SRE practices. These concepts help reinforce core SRE Practitioner principles.

Identify SRE anti-patterns in scenarios
Define SLOs from sample SLIs
Defining actionable alerts
Leveraging Chaos Engineering strategies
Write a postmortem summary from a mock incident

1SRE Principles and Anti-Patterns

Review SRE Terminology
Transforming operations to an SRE Culture
Apply SRE principles in Your Organization
SRE Anti-patterns: over-automation, tool obsession, ignoring toil
Activity: Identify anti-patterns in scenarios

2Service Level Objectives (SLOs) and Error Budgets

Importance of user-focused reliability
Defining SLIs, SLOs, SLAs, and Error Budgets for customer happiness
Recognize why services need SLOs
Error budget policies and governance
How security incidents (e.g., DDoS, data leaks, etc.) affect availability and error budgets
Activity: Define SLOs from sample SLIs
Quiz

3Monitoring and Observability

Monitoring
Creating an effective alerting strategy with burn rates
Discover Observability-Driven Development (ODD)
Compare observability vs. monitoring
MELT model
Activity: Defining actionable alerts
Quiz

4Chaos Engineering and Resilience Testing

Chaos engineering
Disaster Recovery (DR) Testing
Game Day Experiments
Activity: Leveraging Chaos Engineering and Game Days strategies
Quiz

5SRE Incident Response and Postmortems

Incident lifecycle: Detection, Response, Remediation
On-call practices, toil management
Blameless postmortems and continuous improvement
Activity: Write a postmortem summary from a mock incident
Quiz

6Platforms, Generative AI, and AIOps

Platforms for secure and reliable systems
Using AIOps with observability + incident response
Where can GenAI and AIOps fit in your SRE practices?
Intelligent alert summarization
Postmortem drafts
Developer support and SLO diagnostics
Challenges: hallucination, trust, governance
Quiz

Ready to accelerate your team's innovation?

Schedule a meeting

Unlock your team’s potential and get the most from your tech stack

Schedule a meeting

SRE Practitioner

Cloud Computing

On-Site, Virtual

1 day

Learning Objectives

Who Should Attend

Activities

Course outline

SRE Practitioner

Course Description

Learning Objectives

Who Should Attend

Activities

Course Outline

Module 1: SRE Principles and Anti-Patterns

Module 2: Service Level Objectives (SLOs) and Error Budgets

Module 3: Monitoring and Observability

Module 4: Chaos Engineering and Resilience Testing

Module 5: SRE Incident Response and Postmortems

Module 6: Platforms, Generative AI, and AIOps

Ready to accelerate your team's innovation?

Unlock your team’s potential and get the most from your tech stack

SRE Practitioner

Cloud Computing

On-Site, Virtual

1 day

Learning Objectives

Who Should Attend

Activities

Course outline

Course Description

Learning Objectives

Who Should Attend

Activities

Course Outline

Module 1: SRE Principles and Anti-Patterns

Module 2: Service Level Objectives (SLOs) and Error Budgets

Module 3: Monitoring and Observability

Module 4: Chaos Engineering and Resilience Testing

Module 5: SRE Incident Response and Postmortems

Module 6: Platforms, Generative AI, and AIOps

Ready to accelerate your team's innovation?

Related courses

Unlock your team’s potential and get the most from your tech stack