CatOps posts from 15th to 28th of August
Some articles were posted earlier than two weeks ago, but since this is the first issue of the newsletter, I think it’s appropriate to include them as well.
Master Azure bundle by Pluralsight - time sensitive - a Humble Bundle collection of online courses on Microsoft Azure by Pluralsight. The bundle is still active for another 14 days (at the time of this letter is sent).
How to handle Kubernetes health checks - lessons about Kubernetes probes, learned the hard way by DoorDash (food delivery service). View in Telegram.
Microsoft’s view on the security in the next-gen communication networks - Microsoft shares its thoughts on zero-trust networks. Mostly in regard to telecommunications. View in Telegram.
Learn Postgres at the Playground - Free learning materials for PostgreSQL in the form of interactive labs by Crunchy. View in Telegram.
Rego: getting started - Getting started guide for Rego - a programming language that is used for the Open Policy Agent - a popular framework to write policies as code. View in Telegram.
Random Thoughts
I’ve started reading Implementing Service Level Objectives book by Alex Hidalgo. For now, I have only finished a few first chapters, which are focused on the theoretical aspects of SLI, SLO, SLA, and Error Budget - I don’t see the point of repeating those here. However, here’s a highlight that really caught my eye:
A common misconception is that you can just make SLOs an Objective and Key Result (OKR) for your quarterly roadmap and somehow end up at the other end being “done” in some sense. This is not at all how SLO-based approaches to reliability work.
That’s quite funny, because we literally have an OKR “Set SLOs for our sub-systems” at the moment :D
Although, I’d say it’s Ok to have an OKR as a starting point. The main thing is that you don’t treat SLO as “done” in the end - this is an ever evolving process.
Another interesting point is that you have to think about your SLIs from the user’s perspective. You can put a straightforward metric like “are my replicas up?” at first, but what does it give your user? Do they still get a satisfying quality of service when all the replicas are up & running?
A few other highlights:
Service level objectives are ultimately about happier users, happier engineers, happier product teams, and a happier business. This should always be the goal — not to reach new heights of the number of nines you can append to the end of your SLO target.
SLOs are a way to gather data to help you have discussions and make decisions that will allow you to take a better approach to improving your systems. They facilitate data-driven decision making and help system designers be more effective.
It’s important to reiterate that SLOs are objectives — they are not in any way contractual agreements. You should feel free to change or update your targets as needed. Things in the world will change, and those changes may affect how your service operates.
If you violate your SLO, you generate a piece of data you use to think about the reliability of your service. If you violate an SLO over time, you have a choice about doing something about it. If you violate your SLA, you owe someone something.
Hope you’ve enjoyed the first issue of the CatOps Newsletter! If you have any feedback or ideas on how to improve the newsletter or the channel itself, let me know! You can find my contacts here.
Cheers!