• ROI Training

SRE Practitioner

Contact us to book this course
Curriculum icon
Curriculum

Cloud Computing

Delivery methods icon
Delivery methods

On-Site, Virtual

Duration icon
Duration

1 day

Site reliability engineering (SRE) is a software engineering approach to IT infrastructure and operations that aligns incentives between development and operations and includes mission-critical production support. This one-day course prepares participants for the DevOps Institute SRE Practitioner certification by covering core principles such as SLOs, error budgets, incident response, and reliability engineering. The course integrates modern topics including platform engineering, security, AIOps, and GenAI, with practical exercises and real-world scenarios.

Learning Objectives

  • Understand and apply core SRE principles, including SLOs, SLIs, and error budgets
  • Identify and avoid common anti-patterns in site reliability engineering
  • Design observability strategies that support proactive incident detection and resolution
  • Create an effective alerting strategy with error budget burn rates
  • Practice blameless incident response and create effective postmortems
  • Integrate security considerations into reliability, automation, and monitoring workflows
  • Leveraging Chaos Engineering to find a mitigate reliability issues
  • Strategies to reduce toil: what to automate, when, and how
  • Explore practical uses of GenAI for SRE tasks such as postmortem generation and knowledge sharing

Who Should Attend

This course is designed to help prepare for the DevOps Institute SRE Practitioner certification. It is aimed at development and operations engineers, platform engineers, and technical managers, but can also be useful for product and business leaders looking to formalize and advance their SRE expertise. Prior SRE knowledge is assumed at the level of course 975J: SRE Foundations.

Activities

This course contains several design activities (group or individual) to provide real-life experience with creating these SRE practices. These concepts help reinforce core SRE Practitioner principles. 

  • Identify SRE anti-patterns in scenarios
  • Define SLOs from sample SLIs
  • Defining actionable alerts
  • Leveraging Chaos Engineering strategies
  • Write a postmortem summary from a mock incident

Course outline

  • Review SRE Terminology
  • Transforming operations to an SRE Culture
  • Apply SRE principles in Your Organization
  • SRE Anti-patterns: over-automation, tool obsession, ignoring toil
  • Activity: Identify anti-patterns in scenarios
  • Importance of user-focused reliability
  • Defining SLIs, SLOs, SLAs, and Error Budgets for customer happiness
  • Recognize why services need SLOs
  • Error budget policies and governance
  • How security incidents (e.g., DDoS, data leaks, etc.) affect availability and error budgets
  • Activity: Define SLOs from sample SLIs
  • Quiz
  • Monitoring
  • Creating an effective alerting strategy with burn rates
  • Discover Observability-Driven Development (ODD)
  • Compare observability vs. monitoring
  • MELT model
  • Activity: Defining actionable alerts
  • Quiz
  • Chaos engineering 
  • Disaster Recovery (DR) Testing
  • Game Day Experiments
  • Activity: Leveraging Chaos Engineering and Game Days strategies
  • Quiz
  • Incident lifecycle: Detection, Response, Remediation
  • On-call practices, toil management
  • Blameless postmortems and continuous improvement
  • Activity: Write a postmortem summary from a mock incident
  • Quiz
  • Platforms for secure and reliable systems
  • Using AIOps with observability + incident response
  • Where can GenAI and AIOps fit in your SRE practices?
  • Intelligent alert summarization
  • Postmortem drafts
  • Developer support and SLO diagnostics
  • Challenges: hallucination, trust, governance
  • Quiz

Ready to accelerate your team's innovation?