This world is crazy. I don't understand why my client can't get that they need to adopt Observability. The company wants to do Cloud Native transformation, start doing DevOps. Yet first thing they plan is to move from Jenkins pipelines to Gitlab CI.
Wait. What? What's the point?
You just created a bunch of work to do, with no real benefit.
How is this even close to DevOps?
I've been telling them over and over again that, just like a broken record that you need to start with monitoring and observability. If you want to deploy to production faster and more often, first you need to understand what the hell is going on with your systems.
First of all, if you don't understand and can't clearly see what's going on with your systems, you are not doing DevOps. Period. There is no point in even pretending that you are doing DevOps.
Look at Google's Site Reliability Engineering book, you don't need to read the full book, just look at the first three chapters. It starts from talking about embracing the risk, SLOs, and then monitoring. All of them relate to monitoring and observability.
SRE books give you this hierarchy of needs to make your product reliable:
Look at this pyramid. What does everything rely on?
Reliable production systems need to have good monitoring in place. Without monitoring, you have no way to tell whether the service is even working. You’re flying blind. Maybe everyone who tries to use the website gets an error, maybe not. Who knows?
What does monitoring rely on?
Observability is about turning your blackbox appljcations into open, instrumented microservices, which allow you to quickly inspect and understand whats going on. It's ability to instantly observe how well are they running.
Honestly, before you do anything else on your transition to DevOps, you need to figure out your plan to do Observability. So let's figure it out.
How to do Observability?
Tip 1. Productionize your programming languages
When you are on-call for a service written Java is going to be a bit different than in PHP or Go. A lot of it depends on the ecosystem too. You need to have well-built frameworks, libraries, and tools. Some of the frameworks or libraries are heavier than others, some of them are harder to instrument and operationalize. Let's look at some questions that can point us into the right direction:
- How long does it take a typical app to start?
- How much memory / CPU app requires without any load?
- What's the max load it can handle?
- How do resources look at max load?
- How does it behave when the load goes over the top?
These are some of the crucial questions you need to have a quick way to know. You should have a ready-made dashboard for this, which is just a click away, to separate you from organizations that have no clue from ones that do. Knowing the behavior is crucial to identify and fix the problem quickly. Looking at your dashboard, you should be able to clearly identify that app is not starting up vs being overloaded.
Typically there are a lot of differences between how programming languages handle memory and concurrency. Go provides lightweight threads and garbage collection. Python has a global interpreter lock. Java has virtualized everything with a JVM. PHP relies on web server to do most of the work. Some of the important questions in this space are:
- What are the key indicators that show you have a memory leak?
- Does it have a garbage collection?
- How does concurrency work in that language?
- Does one of your app threads or goroutine leaks?
It should be relatively easy for you to identify problems with GC, or a memory leak. By simply looking at your premade dashboard and see some graph spiking.
Tip 2. Alert on most important service metrics
Tracking customer-facing application metrics is a must! There are a couple of patterns that you should follow. One example is RED method, you track request rate (i.e. HTTP requests per second), error rate (500x responses per second), and duration. For duration, you typically look for 50th percentile to be less than X milliseconds, and 99th percentiles less than Y milliseconds.
Another example is Google in their SRE book they talk about Four Golden Signals.
The Four Golden Signals are:
- Latency - how long it takes for the service to respond to a request.
- Traffic - the number of requests that your service is currently handling.
- Errors - the rate of requests that fail.
- Saturation - how many requests can your service handle without breaking.
Your application doesn't have to expose Prometheus or Graphite metrics directly, you can pretty quickly calculate RED method metrics from logs. But you need to make sure that logs are
appropriately structured and that all relevant information is there.
The critical bit here is to alert on the most important service metrics. These metrics typically give you a quick way to know when your customer is experiencing issues. You can also do SRE style alerts by figuring out your error budgets and writing down "contractual" style Service Level Objectives.
Similarly, you need to ensure that you have the most critical metrics for your databases/queues and other stateful services. So many outages happen because MySQL goes wonky and nobody notices.
You definitively want to know that, before your customers do.
Tip 3. Add some blackbox monitoring into the mix
Getting usable service metrics sometimes can be tricky. For example, you are hosting an FTP server. Most of the open-source servers are written well before Prometheus was a thing, and therefore those open-source servers don't expose any metrics. The most straightforward way to monitor this case is to use Blackbox monitoring approach. Just run an app, which constantly connects to your application and imitates some user behavior. For FTP, it could be putting or reading a file. For web app it could be a simple HTTP request or a complicated user flow.
It's super important as sometimes white-box metrics just don't work. One example is when a service is overloaded, or stuck in a garbage collection loop. The application won't have enough resources to send the metrics..
Blackbox metrics are typically relatively easy to setup. It also makes sense to alert on this. If your application doesn't respond to blackbox probes, you definitely know something is wrong.
Tip 4. Learn querying your metric database
Ad hoc queries are a lifesaver. Learning to query your metric database can help you to figure out: why an outage happened, what actually happened, and do a proper postmortem write-up. After postmortem, you typically would want to improve your application or add those newly discovered metrics to your dashboard and create new alerts.
One example of this was an outage, which happened due to CPU throttling. The application was super slow and not responding, but we couldn't figure out why in time. After a couple of hours, it came back to life, without us really doing anything. Digging deeper into metrics, we found that CPU limits were too low, which caused significant CPU throttling. We probably lost some customers during that day, but we learned a lot from this.
Tip 5. Invest in tracing
There is nothing else, like seeing request flow end to end and seeing what actually is going on. You can conceptualize this and just think, I call this service, that service calls different service, etc, and sort of knowing all the call-graph. But things are always changing. What was true yesterday, may not be the case today. So if things are okay right now, doesn't mean they will always be this way. That's why you should invest in tracing. You will have a fantastic tool to look at actual end-to-end requests and their latencies.
Hopefully, by now, you can start to envision the start of your journey into DevOps. The most crucial thing in DevOps is don't fly blindly. If you can take away from this article anything, is that you need to invest in observability now. Just do it.
Here at PrometheusKube we believe that we can make it easier for you. We build upon the shoulders of giants to provide you with the best in class Prometheus Alerts, Grafana dashboards, and runbooks for open source software you run daily.