SRE Foundations
Contact us to book this courseCloud Computing
On-Site, Virtual
1 day
Site reliability engineering (SRE) is a software engineering approach to IT infrastructure and operations that aligns incentives between development and operations and includes mission-critical production support. This course starts with an introduction to the main practices of SRE and the role IT and business leaders play in the success of SRE adoption. The course then introduces participants to how Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets should be used to measure a service’s reliability. Attendees will gain experience with creating these measures in practice. These concepts help create a culture where the reliability and success of a service can be objectively measured.
Learning Objectives
- Articulate the technical and cultural fundamentals of Site Reliability Engineering (SRE) and understand the value they can provide to your IT operations in any environment
- Learn SRE terminology
- Understand why services need Service Level Objectives (SLOs)
- Achieve developer and operation harmony with error budgets
- Choose appropriate Service Level Indicators (SLIs) based on user journeys
- Create specific, measurable, achievable, relevant, and time-bound SLOs
- Explore techniques to reduce toil for development, testing, and operations
- Review containerization and microservice architecture
Who Should Attend
This course is aimed at development and operations engineers and technical managers, but can also be useful for product and business leaders wanting to learn more about what SRE is.
Activities
This course contains several design activities to provide real-life experience with creating these SRE measures in practice. These concepts help create a culture where the reliability and success of a service can be objectively measured.
- Improving IT Operations
- Moving to an SRE Culture
- DORA DevOps Quick Check
- SLI Failure Gaps
- Define SLI and SLO Targets
Course outline
- Production Systems
- Reduce Organizational Silos
- Improving IT Operations
- What Is DevOps? DevSecOps? SRE?
- Review the Relationship Between DevOps and SRE
- Shift Left on Security
- Moving to an SRE Culture
- Apply SRE in Your Organization
- Review SRE Terminology
- SLIs, SLOs, SLAs, Error Budgets
- Recognize Why Services Need SLOs
- Incentivizing Reliability Across DevOps Teams with Solid SRE Practices
- Choose Appropriate SLIs Based on User Journeys
- Create SMART (Specific, Measurable, Achievable, Relevant, and Time-bound) SLOs
- Calculating and Leveraging Error Budgets
- Define Toil
- Recognize Toil for Developers, Testing, and Operations
- Leverage Automated Tools, Builds, and Testing to Eliminate Toil
- Reducing Developer Toil with Source Code Management (SCM)
- Reducing Operations Toil with Infrastructure as Code (IaC)?
- Containerization vs. Virtual Machines
- Advantages of Containers
- Building Container Images: Docker
- Monolithic vs. Microservice Applications
- Understanding Container Orchestration: Kubernetes