WRITELOOP

SRE 101

2022 March 9
  • What is the problem of the traditional Dev and Ops Roles? Devs want to deploy new features FAST Ops wants to maintain stability. So, there is a conflict of interest. Also, operations tend to introduce bureaucracy to keep stability, slowing down the development cycle.

  • Difference between SRE and devops?

    • Devops focus is to allow a faster release process.
    • So, Devops is more optimized for speed than stability & reliability.
  • What is SRE? Site Reliability Engineer. “SRE is the new ops team, responsible for automations to help developers to release SAFELY and FAST.”

  • How does Google (who “invented the concept”) defined SRE ? Google defined “SRE is what happens when you treat operations as a software problem and you throw a bunch of software engineers to solve that problem.

  • What is the configuration of a SRE team? SRE team is made up of software engineers who build and implement software to improve the reliability of their systems/services.

  • What is a “system” on the context of SRE? The whole deployment environment - servers, cloud & virtual machines, applications and services, databases, networks, etc…

  • What is “Reliability” on SRE? The quality of a system being rarely inaccessible.

  • Why is Reliability important? It can affect a company’s revenue and impact negatively their users.

  • What makes a system unreliable? Changes (infrastructure, platform, services and applications)

  • Why changes should NOT be limited? Make the app better Increase business value Stay competitive

  • What makes SRE a better solution? It tries to automate the process of evaluating effects a change will have. That allows making changes fast AND safe.

  • How is the automated evaluation of changes done? Using SLA (Service Level Agreements). They are a % metric that help to evidence how reliable a system is going to be to its’ end users.

  • Is 100% availability an achievable goal? Yes, but that is difficult and expensive. Very few services need 100% SLA. E.g., a cellular network (Claro, Vivo, etc..) only warrant 99% SLA.

  • What kinds of SLA can you have? Accessibility (the service being up and running) Response Time Error Rate

  • Who defines SLAs? Business People (since it impacts end users) + Engineers. Business People define SLAs on a higher level. Engineers define SLAs on a technical level.

  • How can you have and idea for the right SLA? Industry benchmarks User Feedback Competition

  • What is an Error Budget on SRE? Allowed downtime of a service. Team can spend error budget on making unreliable changes.

  • What to do if the SLA is below expected? Move more engineers from “Software Development” to “Operation Tasks”. Less changes (releases) will be allowed.

  • What to do if the SLA is above expected? Move more engineers from “Operation Tasks” to “Software Development”. Developers then can make more changes (releases).

  • What does an SRE do? Automation: to evaluate if the services are in the SLA or not. (eliminates manual bureaucracy) Observability: configure monitoring and logging (to measure uptime)

  • How does SRE prepares for outages? Detect issues before or early when they happen: monitoring, logging & alerting Alerts must notify the correct person, and have all needed information included to help fast detection. E.g.: bad one: “something is wrong on the cluster” best one: “service A in cluster B is throwing 500 error” “on-call support”: An SRE is designed to work with the Support Team.

  • What are the benefits for part of the SRE team to work with the Support Team (e.g. ocasionally)? Know what issues to expect Know how the support deals with issues Discover what improvements can be made

  • What is the main goal of SRE? Small scope of incidents: Short duration of outages Few people affected Few services affected

  • What to do after an outage is fixed to ensure we learn something from it? Postmortem: after-outage analysis.

  • How to conduct a postmortem? Thorough analysis: What systems were affected? What caused the outage? Who did what? Who fixed what? Stay blameless (do NOT point people mistakes) Document everything

  • Who must do SRE? Ideal: A dedicated professional, working full-time to keep systems reliable. Generally you have one team of SREs + Software Developers Reality: You can have SREs doing software development as well, and SREs must know software development.

  • What are the differences between Devops and SREs? Devops engineers do not have to know software development (?) “Original Devops Concept”: They know WHAT needs to be done SRE: They know HOW to do what needs to be done. “Practical Devops”: more focused on speed of delivery of application changes SRE: more focused on reliability of services

NOTE: The original content(s) that inspired this one can be found at:
https://www.youtube.com/watch?v=OnK4IKgLl24&list=PLJI2RX4Ltq-kKEf9AfxkcynGPX-ToBy48&index=5
All copyright and intellectual property of each one belongs to its' original author.