Site reliability Εngineering (SRE) is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals are to create scalable and highly reliable software systems. During this 8-hour course participants will learn the fundamentals of SRE, so the principles & practices that enable enterprises to reliably and economically scale critical services. They will learn about what makes SRE such an important discipline when practiced correctly, and how it can improve both the stability and performance of your enterprise applications.
Who should attend
- Software Engineers interested in learning about how to use and apply SRE within an operations environment
- DevOps practitioners interested in understanding the role of SRE and how to consider using it within their own organization
Prerequisites
The course has no specific prerequisites.
What will you learn
During this course you will learn about the following topics.
Culture
- The Dev and Ops Silos
- What is SRE
- The difference between SRE and DevOps
- SRE Principles
- Getting on-board the SRE Culture
- SRE team topologies
- Expectations vs Reality
- Hiring SREs
- Educating SREs
Automation
- What is Toil
- Techniques for reducing toil
- Tips for automation
- Exploring the different levels of automation
- Automation Pitfalls
Testing
- Why it matters
- Pre-Production Testing: Unit, Integration, Load
- Production Testing: Canarie, Flags and Chaos Engineering
Incident Management
- Exploring the Team Topologies
- Incident Response Protocol
- Troubleshooting Sane Practices
- Tools of the Trade
- Writing Postmortems
- Incident Management Training
Observability
- Monitoring
- Golden Signals
- Alerting
- Logging
- Tracing
- Common Pitfalls
- Sane Practices
Introduction to SLOs/SLIs
- What are the Service Level Indicators
- What are the Service Level Objectivess
- Error Budgets
- Good Practices and Common Pitfalls
- Workshop
The Reliability aspect of SRE
- What is Reliability
- The difference between failure and fault
- Tolerating faults
- Reliability Practices
- Ensuring compliance with reliable practices (Production Readiness Reviews)
- Deployment strategies
- Benefits of cluster orchestrators
- Kubernetes
- The Operator Pattern
- Disaster & Recovery
- Capacity Planning
- Drills
Introduction to Chaos Engineering
- History
- What is Chaos Engineering
- Running Chaos Engineerign Experiments
- Tools of the Trade
- Common Pitfalls and Tips
- Gameday example