What is Observability All About?

Vanessa Martini
Principal Product Manager - Technical, Observability Analytics & UI

Reviews

0
No votes yet
Automatic Summary

Understanding Observability: A Comprehensive Guide

Hello everyone! My name is Vanessa Martini, and I'm excited to share my insights about observability with you today. As an observability product manager at Red Hat, I focus on analytics and UI for OpenShift, and my goal is to introduce the concept of observability in under ten minutes. Let's dive into this fascinating world through relatable examples and key concepts.

What is Observability?

To grasp the essence of observability, it helps to visualize it in day-to-day scenarios. Imagine you wake up with a persistent headache. Initially, you might self-assess by checking for temperature or blood pressure anomalies. However, as the headache lingers, you consult a healthcare professional, undergo tests, and gather information from specialists to diagnose the underlying issue.

  • Self-Assessment: Measuring temperature or monitoring other symptoms.
  • Consulting Professionals: Engaging with a doctor and specialists for targeted diagnostics.
  • Diagnostic Testing: CT or MRI scans for comprehensive visibility into your health.

This gradual process of collecting and analyzing data mirrors the purpose of observability in software systems: to enhance understanding and enable effective decision-making.

Different Types of Observability

Observability can be categorized into several types, each serving specific roles within an organization:

  • System or Infrastructure Observability: Focused on app developers, platform engineers, and site reliability engineers.
  • Data Observability: Aiding data engineering and analytics teams to visualize data pipelines.
  • Machine Learning Observability: Assisting machine learning engineers and data scientists in monitoring model training performance.
  • LLM Observability: Newer practices ensuring large language models perform accurately.

This article specifically delves into System Observability, which is crucial for navigating complex cloud computing environments and microservice architectures.

The Pillars of Observability

Two fundamental aspects characterize observability:

  • Metrics: Numerical indicators that represent the system's performance, akin to vital signs.
  • Logs: Text records that capture events, warnings, and errors, providing essential context for troubleshooting.
  • Traces: Visualizations of requests as they traverse through different system components, highlighting potential bottlenecks.

Along with these primary pillars, signals from alerting systems, network events, and Kubernetes events play integral parts in developing a comprehensive observability strategy.

How Observability Works

Effective observability is more than just data collection; it encompasses:

  • Storage: Keeping collected data organized for future analysis.
  • Delivery: Ensuring timely access to data as needed.
  • Analytics: Using analytical tools to discern patterns and insights.
  • Visualization: Displaying data in a way that's easy to understand and act upon.

This approach helps avoid data silos and fosters incident management through the correlation of various signals collected from the system.

Real-World Applications of Observability

Observability impacts many industries. Here are a few examples:

  • Telecommunications: Monitoring signal quality, detecting outages, and predicting equipment failures through AI.
  • Rail Systems: Real-time train tracking and onboard systems monitoring for operational efficiency.
  • Banking Services: Ensuring 24/7 availability and smooth digital transactions.

Incorporating an observability strategy can lead to increased operational efficiency, faster issue resolution, and enhanced customer satisfaction.

Challenges in Implementing Observability

When formulating an observability strategy, several challenges arise:

  • Cost Management: Training personnel and managing unpredictable licensing fees.
  • Choosing the Right Tools: Balancing between vendor solutions and open-source options.
  • Data Volume Management: Handling increasing volumes of data without losing critical insights.
  • Avoiding Data Silos: Ensuring that information flows freely across the organization

Video Transcription

Hello, everyone. So welcome to the lightning talk, what is observability all about? My name is Vanessa Martini.I am an observability product manager working at Red Hat focusing on analytics and UI for OpenShift. And today, I have a mission, introducing to you all the concept of observability under ten minutes. So let's deep dive into this world starting with a couple of real life examples. First thing first, I would like you to picture yourselves waking up with a headache, a headache that persists for days, affecting your sleep impactors and energy levels. At first, you decide to self assess the situation, for example, by understanding whether you have other pains. You might also measure your own temperature or blood pressure at home.

Basically, you detect that there is an actual issue and start collecting some metrics, but this may not be sufficient. You then go and talk to your primary care physician to assess recent lifestyle changes or potential traumas experienced. You may have one or more conversations and information exchanges with with specialists to better understand the why behind this headache and what issue it could be. And especially if a long lasting headache, additional diagnostics might be needed. A CT scan or MRI scan might be conducted to better understand the health status of your brain, neck, or face, including cranial nerves, skull, blood flow, tooling that allows medical staff to locate the issue, if any. What is our goal by doing all this, you may wonder? Well, together with medical staff, we collect data to have better visibility into our body.

With better visibility, we are better equipped to analyze, impost and discard irrelevant information, leading to a better understanding of our body. With a better understanding of our body, we are also able to diagnose issues faster, which leads to better health. And, again, a better understanding of our body and health leads us to start the most affecting therapies and lifestyle adjustments, helping us collect additional information over time and, therefore, have even more visibility into our body. In other words, it's a never ending process of improvement and optimization. Another example could be vehicles, a car or truck. While driving, you feel that something is off, that there is an issue but cannot figure out why and where. The dashboard in front of you does not show any anomaly, but before the vehicle stops completely, you go to a mechanic to better understand what is going on.

So the mechanic will do some additional checks to identify where the issue is as it might have been affecting other components of the car. So analyzing the collected data will allow to achieve a diagnosis, optimizing the vehicle, and finding a long term solution to the problem, making sure that it runs smoothly for you. So we have seen how the collection and analysis of different information allow us to connect the dots and have a clear snapshot of our health or our vehicle's health, being aware of initial symptoms, assessing the first alerts our body sends us, and investigating the issue with the support of specialists and targeting diagnostics.

So this is the purpose of observability for software systems, providing the ability to proactively do something and understanding the consequences if you decide not to. And with this context in mind, how can observability be best defined? There are different types of observability, system or infrastructure observability targeting app developers, platform engineers, set reliability engineers, data observability targeting data engineering and analytics team, which allows for better visibility on data pipelines, machine learning observability targeting machine learning engineers and data scientists, allowing them to evaluate the performance of training models, and lastly, LLM observability, a new kid on the block, which allows teams to ensure that large language models perform accurately.

In this talk, we specifically focus on the first one, system observability. And observability is essential in complex cloud computing environments and highly distributed microservice architectures, which are comprised by so many different components interacting with each other. Three pillars are the foundation of observability here, metrics, logs, and traces. On top of this, information such as alerts, network events, Kubernetes events also play a key role. By looking at these three pillars only, metrics and numerical values representing the performance of the systems, so the so called vital signs. Logs are text based, are records of events, warning, and errors, which provides the context we need for investigation. A trace is instead composed by a tree of spans which provide a view of a request as it flows through the system pinpointing bottlenecks.

But observability is not just plain data collection. Storage, delivery, analytics, and visualization are all key for building a single pane of glass of our system. It's important not to fall into the trap of data silos, but instead correlate the different signals our system is collecting to facilitate incident management and apply AI ops to enable proactive issue resolution. In other words, observability is a property that allows to extract meaningful insights from our system. Observability triggers a continuous feedback loop that no other tooling can provide us with. How? So by collecting data from all layers of our stack, metrics, logs, traces, and events, we get better visibility into our system. By turning this data into actionable insights, we can have a better understanding of our system. And with a better understanding, we can be build better systems, more reliant, more performant.

And with better systems, we can also instrument better and collect even more granular information out of our systems, which again leads us to better visibility into it. So So we are more and more empowered to detect issues earlier, resolve them faster, and make our system highly reliable. So even here, it's a never ending cycle of optimization as our system evolves, exactly what we saw earlier when talking about our own health, if you're the same if you remember the same, graph. And, again, the impact of an effective observability strategy is all around us. Let's take the telecommunications industry as an example. Telecom towers support mobile and wireless communication by hosting antennas, base stations, supporting equipment. So observability for those may include power supply checks, signal quality, connection failures, outage detection, but also AI powered anomaly detection to predict equipment failures.

Another example is the importance of of observability in rail systems, which enables real time train tracking, but also onboard systems monitoring, also to have access to control centers, dashboards to assess infrastructure, and train health. And last example here, banking services. So observability can be crucial to provide twenty four seven availability and real time transactions, but also to provide access to digital banking services to our apps on our phone, as you can see from this picture. So we have seen that the importance of observability in our day to day life is tangible. From a organizational perspective, an effective observability strategy can increase operational efficiency by providing tools for faster issuer solution, which in turn increases customer satisfaction, which again in turn improves the overall organizational reputation. But where to begin, you may wonder. Open source could be a great starting point here.

So here, all the open source projects and integrated ones covering all aspects of accessibility. This overview is provided by the Cloud Native Computing aspects of accessibility. This overview is provided by the Cloud Native Computing Foundation. There are so many options you can pick from, from vendor solution to fully open source ones, from graduated open source project to sandbox ones, which are highly innovative and backed up by growing communities. And you can use these projects as building blocks for your own use case, and there is no right or wrong. For picking the right tools, you need to weight your priorities including budget. And this leads me to this section. So there are, in fact, challenges to account for when formulating an observability strategy.

The importance of managing costs, so training dedicated personnel or dealing with unpredictable licensing costs, choosing the right approach for your needs, so relying on vendors, or bidding with open source. There are pros and cons related to both. And as system become more and more complex, effective observability means handling and increasing data volume, leaving out the noise, avoiding data silos, and really making use of the right analytical tools. So e Red Hat open source is the answer, our way forward. That is our OpenShift observability solution today. We have both a single cluster and multi cluster observability stacking place as part of OpenShift, OpenShift being a container platform. This OpenShift possibilities solution is powered by projects such as Prometheus, Vector, and OpenTelemetry for data collection, tunnels locking tempo for storage, Kepler, correlator, Cruise for analytics, Persis for visualization, and many more, as you can see from this slide.

So we have seen how effective observability allows to optimize systems and provide a seamless user experience. It's a journey of improvement that has become irreplaceable in today's competitive market, especially coupled with AI. So I hope I've raised some curiosity around the topic. Feel free to reach out to connect in case you have any questions. Thank you for joining me today. I hope you have a great day. So thank you all.