An Intro to Observability

In the Cloud Native world you often come across the word “observability”. What is it and why does it matter? Read on to learn more!

Background

In the Cloud Native world, you are typically talking about microservice-based architectures. Microservices are gaining in popularity because they offer the ability for developers to move faster. This is done by isolating a subset of functionality to a microservice that one or more other microservices can leverage. For example, most applications require authentication. Instead of a company with many applications having a different implementation of authentication for each application, they can all share a single authentication service.

In reality, when talking about microservice architectures, the situation is even simpler, like a single application made up of multiple components. For example, perhaps it has a UI, API layer, ingestion layer, query layer, analytics layer, persistence layer, etc. Each layer could be its own service, and some layers may even have multiple services. The ingestion layer may consist of a pre-processing service, a message bus service, an ingestion analytics service, and a persistence writer service.

The idea is that with code separate, a microservice only has to care about itself and the contract it makes available for others to consume. This means a microservice can be written in the language that best suits the requirements of the service. While this does provide agility, it also introduces some challenges. The one I will be focusing on in this post is around observability.

What is Observability?

On the surface, observability is the Cloud Native’s word for monitoring. A new word was invented because traditional monitoring tools are not sufficient in the Cloud Native era (more on this in a bit). Typically, when you hear about observability, you will also hear about the three pillars of observability. These are:

  • Metrics
  • Logs
  • Traces

So is observability just metrics, logs, and traces? No, because if it were, it would basically be traditional monitoring (except likely the traces). Instead, these data sources are what feed observability with the data that it needs. Personally, I am not a big fan of the three pillars of observability for reasons including:

  • It makes it sounds like observability is just about metrics, logs, and traces
  • It makes it sound like metrics, logs, and traces are all equals when it comes to observability

In reality, traces are the foundational data source in the Cloud Native era. The reason for this is because they provide context and correlation that metrics and logs lack. In fact, traces can actually be used to add context and correlation to metrics and logs (more on this in a future post).

So what is observability? The short answer is it is the ability to measure symptoms (i.e., ask questions about a system) and explain them (i.e., get to root cause) via open standards and open source data collection.

Why Observability is not Monitoring

Traditional monitoring allows you to measure symptoms; in fact, traditional monitoring tools ONLY allow you to measure symptoms. You are responsible for determining why through exploration and domain knowledge. For example, a metrics backend can tell you about high CPU usage on a service, but not why that high CPU is occurring. A metrics backend can also not tell you the impact of that high CPU across requests throughout your system.

In addition, traditional monitoring often makes asking questions difficult. For example, metrics may be aggregated in 5 minutes windows, traces will be heavily sampled, and everyone penalizes you on price based on ingestion rates. These issues will significantly impact observability in part because you cannot ask the right questions. To ask the right questions and explain why something is happening, you need the complete picture from your data sources and context and correlation. While it may be necessary to sample data sources, the aggregates must be 100% accurate 100% of the time.

Besides the limitations of traditional monitoring, people want vendor-neutral data collection so they can choose the backend or backends that fit their needs. Traditional monitoring was built primarily on proprietary agents. Note when I say proprietary agents, I do not necessarily mean closed-source agents. I am also referring to open-source agents that only a single vendor uses. In the Cloud Native world, the infrastructure has been commoditized and open-sourced. The same is true for the collection of observability data. This is why when you hear about Observability tools, you hear about things like Jaeger, Zipkin, Prometheus, and ELK, not AppDynamics, Datadog, and Splunk.

Summary

Observability is the new term for monitoring in the Cloud Native era. It was created because traditional monitoring tools were not built to deal with issues in microservice-based architectures (as much as their marketing would have you believe otherwise). When you think about observability, you should think about open standards and open source data collection with context and correlation provided by distributed tracing. You know you have achieved observability when you can ask arbitrary questions about your environment and explain why they are happening.

© 2019 – 2021, Steve Flanders. All rights reserved.

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top