Observability in today's complex systems by Shweta Sharma

Automatic Summary

Understanding Observable in Complex Systems: The Key to Robust Software Applications

Hello there, I'm Shweta, a Senior Software Engineer at Intuit. Today, we'll delve into the concept of observability in complex systems, with a sprinkle of references from the famous movie, Matrix. As we navigate through the digital world, understanding the inherent complexities of a system becomes critical. Let's explore how observability helps us do just that.

The Role of Observe in Complex Systems

Remember Neo from the Matrix? He could see what was happening in the system, read the metrics, and predict the system's action. But how was he able to do that? The architect and Neo were able to see through the metrics because they observed them closely. This allowed them to identify patterns and anomalies.

Understanding System Architecture

Systems can be broadly categorized into monolithic and microservices architecture. Monolithic architecture encapsulates the whole application, from the client code to the database, as one unified project. On the other hand, microservices architecture breaks down the application into smaller independent units, allowing the system to grow as business and user requirements expand.

Observability and Microservices

Why is observability crucial when we talk about microservices? With multiple microservices communicating with each other in a complex system, failure modes increase exponentially. Observable helps you find answers to understand the cause of failures, track performance issues, identify latency, and observe unusual system behavior.

The Pillars of Observable

Making your system observable requires three major pillars:

  • Logging: It provides critical information about the processes running in a system.
  • Metrics: Quantitative information about the system, often represented as counts or measures, and usually aggregated over a period of time.
  • Traces: Records of individual transactions or requests as they flow through a system, offering vital context for telemetry data.

Tools to Implement Observable

To harness the power of observability, various tools can be utilized such as open telemetry for tracing, micrometres for logging metrics, and light step for visualization. Remember, successful observability is about understanding and observing your system to find anomalies or changes in behavior and detect those issues at the right time.

Observe vs. Monitoring

Observe usage instrumentation to provide insights that aid monitoring. While monitoring is about known issues and understanding the system, observability comes into play when there are a lot of unknowns. Simply put, you can say monitoring is what you do after a system is observable.

In conclusion, whether you're engineering a new software application or enhancing an existing one, understanding observability will be instrumental in creating a robust and efficient system. And remember, like Neo, you too can see through the complexities in the system if you observe closely. Connect with me on LinkedIn for any queries or further discussions on this subject.

Thank you, stay observant, and keep decoding those systems!


Video Transcription

This is Shweta and we will be talking about observable in today's complex systems. But before that, I would like to say hello to all of you. So, um I am uh Shweta, I'm a senior software engineer in Intuit.And uh then uh I am the mother of these two little monsters and of course, I love to dance and do yoga and walking is uh something which uh keeps me sane with all this going on, working from home and pandemic and things like that. Um Well, moving forward the agenda of today is very simple. What we want to do is we want to talk about how observable helps us get answers of all these ws Well, I watched this movie Matrix very recently. I love the craft which creative team has done. And yes, I'm talking about the one which ha was released in 1991. But uh I will be using a lot of references from the movie today and uh we'll refresh your memories. So to start with, if you can recollect Neo was the one who could see what was happening in the system and he was able to look at the metrics and converse or predict the action of the system? How was he able to do that? How did he know all the fighting patterns of the agent? I know that there was software they used to install and learn the art immediately, but he specifically knew their actions and was able to fight much more effectively compared to his teammates.

And do you remember when he meets the architect? The architect tells him that he is an eventually of an anomaly which he the architect could not eliminate. So what is anomaly anomaly is um is um your system or application has a pattern to do certain task? And the pattern is collected or generated as metrics? Right? Anomaly is what breaks that pattern for the same data, same trend or same use cases. It could be because of the day of the week, time of the day program running something breaks down. It can be anything we don't know. But the question is how the Neo and the architect were able to see through these metrics. That is because both of them observed these metrics really closely. One was able to identify the pattern and the other was able to identify the anomaly. So highlight here is the architect knew about the Neo, but he could not fix the anomaly. So what is the learning here? One? However, you try to make a perfect application, there will be some failures which shows no system is perfect. Second, it is hard to fix even a known bug or issue in a complex and old system. And that's what is our focus today. I'll briefly discuss this. So we have a two different type of architectures. One is monolithic monolith architecture is where we have the whole application, a single individual individual unit.

Uh So the solution or application, the client code, the user U I business logic, back end database, everything as one unified uh project and everything is managed at one place. Now, the microservice is that where you have broken down the work of your application into smaller independent units.

So while earlier, when we had to make a change in our code when we were working or we used to work on the monolith, we will be taking the whole code based and making the change. But with the microservice, you focus on each single responsibility pattern where you say OK, this is a service for managing my product. This is a service for managing my inventory. This is a service for managing my sales. This is a service for managing my uh accounts and so on. So basically when you have such a system, now your application or the the the app system is growing as your business grows, as your user requirement grows. And that is the benefit of moving to the distributed or the microservice architecture. So if you see here we are going one layer for one responsibility. This is maybe could be my product inventory and so on. Now, why the observable is required when we talk about the microservices as everything is turning into open source microservice are running on cooper this cluster, we are using CS CD pipelines, DeVos agile and software are being developed at the uh with the speed of light. So with this complicated distributed system where so many microservice are communicating with each other, the possibility failure modes are also multiplying when something fails.

It is no longer obvious what causes just like architect could not figure out in this complex system where this anomaly neo came from and why it was happening. So observable helps you find these answers. What services did a request go through? What performance issues we are facing, which API is contributing to latency where the system taking too long to respond or how the requests are different than usual. Um or how your system is behaving in an unexpected way, which API is actually failing or how did each microservice handled or process the request? Um Were there any failures? What were the success metrics look like? What time did it fail and so on? So to get these answers, your system should be observable, moving on. So what is observable now? Observable helps developers understand multi-layered architecture. It helps understanding and observing your system to find any kind of anomalies or change in behavior, detecting those at the right time and taking the right action when it happens So we talk about observable can help you monitor the behavior, the capacity, the performance, the metrics.

But how will you do that? So we need to have three major pillars to build our system observable observable log. You need the logging metrics traces. And when you have these pillars, you need the visualization because if you can't see it in easier way of communication, it cannot tell you exactly what they're trying to say. It doesn't really convey the meaning, then you will be spending more time in figuring that out. So visual visualization provide you the readability to the data produced by all these pillars. So just an example, again, Neo sees a black cat walk by and followed by a and it just does the same thing. He says Deja Vu, right, which is actually the repeat occurrences of an event that had taken place. So basically they used they use this pattern to figure out that they use this change in pattern to figure out that what is coming next. And that's what the log metrics and traces provides you the change in pattern. I'm going to explain each of these pillars uh one by one and I will also talk about the tools you can use to implement it. OK. Matrix metric is a value that expresses some data about a system. These metrics are usually represented as accounts or measures and are often aggregated or calculated over a period of time. We also call them time series.

The type of metrics usually are we I mentioned use and red metrics you can call it um operational or functional metrics infra application metrics system or application metrics. People call it different names. But basically we're gonna see one part is where we want to talk about utilization, saturation and error. And another part is which is more focused on the application is rate error and duration, which is red matrix use our system level metrics provides you the info like how much memory is being used by a process out of total memory. How much is available? How is your CPU performing? How many DB connections you have? How is each of your port doing? So this talks about the whole system where the red matrix provides you the detail about the application. Uh The number of requests are, are um coming in or per second or are being handled by the service. Number of errors coming through how the FCIS looks like FCIS is failed customer interactions. And what is the latency of through parts here? So a matrix provides a quantitative information about processes running inside the system. And we usually use dashboard to show these metrics in a readable format. But how will this be helpful when I say FCIS it meant? And I said failed customer interaction.

So let's say your customer experience is broken. Application metrics will tell you which API or service contributes to that and why it is happening because there's a drop in your database connection which is reflected in your system metrics. Now, you know, oh, database connections are dropping the database up or not. So you go and check the database is up or not. So you see the customer was facing a problem but you could see, OK, what is wrong in the application? It connected the dot to the to the Infra, OK. I'm not able to connect to the DB it connected the dot to the to the proper infrastructure where OK, the DB itself is not up. So this correlation which uh which provides the complete picture through these metrics. And even if any other team is managing your infra, you still can get this information. Now how to get this um picture in there are multiple tools people use different, different things but I'm gonna talk about briefly about micrometers. So because Java is the mostly used language spring boot um uh provides you the way which uh fro from through which you can get the system or application like level metrics. And also you're not creating without creating any dependency on any other monitoring system, you can select one or any or several monitoring systems here.

And then the data, you will be able to export your metrics data through uh that monitoring system or the visualization system like wavefront or uh anything else. The springboard actuator exposes the underlying metrics and then micrometer provides a facade that can be used to either push or pull metrics to the monitoring system. It could be prometheus, it could be wavelength as I said, we I'm gonna pref prefer or recommend prem because again, it is an open source so you can use any other system. So pre pre periodically pull data via http preet is also an open source monitoring system. And then you can use wavefront or Grafana to uh these metrics in the graph spring boot provides you a lot of metrics out of the box. I'm just showing here. What are the metrics are available out of the box for you? Like uh the HTP server request, client request JVM, memory, garbage collector, thread count. These are all very basic uh metrics we need in our application as well as you get the Kafka Hickory, Metris Kafka metrics metrics out of the box. You don't have to do much coding there.

Uh You can also do uh the application like level metrics. Also, you can get to the timer error gauges uh which will provide you the time per second TPS latency and the traffic information or even error information for your um application level metrics. So metrics basically tell you where the problem is. Right? Moving on. Now, we have to find out what the problem is. So you need more details, right? And that's where the logs come in structured and unstructured lines of text that are emitted by an application, the responses to some events of code that is basically log. But these logs are the record of each and every event that happened on a system. Some logs are auto generated, some are created, but we need it. We need to have it in a structured format. I'm so sorry. I'm going so fast, but I want to cover some important topics in this time. So now when we say structure and unstructured log, if you log something like this error log on this date, it does not provide you any detail you need. When this error happened, what specifically the error was? Where did this error happen? Whom did it fail for? Was it a user? The application, the client? So if tomorrow I want to see OK, where when or where this log error happened, I can just search get user between this to this date and let me see how many users were impacted because of any type of errors.

And then I can specifically focus on why or what it was error. Was it the same error for every user? Was it a different error? How many applications were impacted? So like I'm getting the API here, how many API S were impacted and so on. So the the key here is though your errors has to be very structured, which should provide the detailed uh knowledge of the specific system exactly letting you know what happened, where happened, why it happened. Now, that is a kind of details you need to make your system observable.

But there is a new component which we add in the observable traces, traces um is uh logs can provide you the details about the service error. But remember we are working on a distributed architecture, different API S are calling each other and uh for our API latency and let's say if it is increased now, we do not find an issue in our loves. My application is good but the literacy is increase. What should I do now? How about we add a request ID or unique ID in all the different applications and micro services which will link our call made to all the different services. So I will know when I'm creating a particular user, let's say you're creating a user uh the flow. So when I'm creating a user, it gets created then need to ask for the account information. They just ask for the payment information. I need to ask for the so on information. Where did the application fail? So what is trace? This is the trace. It is a trace which is telling me that I started working from the Air point here. And then I called many different services. This is one service is another service and then this is another service. And this whole thing is will tell me how much time I took from start to finish in between.

I may call if I'm a shopkeeper and managing my inventory, I will call my product, uh, surveys and I'll call my inventory surveys inside the inventory. I will be doing uh some kind of save update and all. And then I will call again. I will let the user know, OK, my product is created. So these are dispense in this one big tra and this, this dispense tells you this is what span I call this service. Now, there is another span I call another service within this I have multiple here which says, oh, this was a very critical work which I had to do. I have completed that. But again, I spent 210 millisecond out of 230 seconds into this particular chunk of work, which actually can help me look into why this particular method or surveys are a component taking so long and can lock some more span to figure out that work for us. OK? Now, moving on if you are still with me and can continue for a few more minutes. OK? Now this is what I talked about traces now to, to get the traces. Also, I'm gonna talk about some systems. How can we get this right? So a single trace show you the activity for an individual transaction or request as it flows through an application, places are critical for observable as they provide you the context for the other telemetry traces can help you define which metrics would be most valuable in a given situation which logs are relevant for a particular issue.

For example, earlier, I was just showing that this much time it is taking. Now you can go and check the metrics for this time duration or for this particular API to know what is working unexpectedly here. OK. To get these traces, excuse me, we can use the different systems. So one is open telemetry again, the open source uh the open telemetry. Y um So with open telemetry, you will be able to find out which api failed you can also use um uh find out which system or service contributed to the latency, defining global attributes such as environment name or the request. Ideal example I gave will provide you um the data for each of the segments or the different resources for your API you can also add custom attributes like um uh product ID or some other ID, the user ID if you want. Um and which will help you track the suspense easily. And then there is uh there are other attributes and that you can have. So I am showing an example of light step here in this U I, which is a visualization tool for open telemetry and also uh contributed to open telemetry. Open source. Jager also provides a minimal U I but the lights step is the paid tool.

Um I do find it good because if you see here, my application myself is get profile for user was being called and it failed it's showing me the error and you see it's showing me the red color error here for all water, no services is failed even take a step back before that. Oh, hold this one API call is communicating and working with so many different system. It is also showing me the different database queries. Yellow color is showing the how much latency it's contributed to and then the queried in selected devices, how much latency is finally coming to.

So you get to know all these different details from the open telemetry or the jar system. OK. People usually get confused with the observable on monitoring. I will just wind up my session after this. Um What exactly an observable and the monitoring observable uses instrumentation to provide insights.

What that aid monitoring. Uh You can say monitoring is what you do after a system is observable. Monitoring is for the errors and understanding the system without some level of observable. Monitoring is not possible. So it is a part of observable and that is why you need metrics to have uh this uh monitoring in place to to go forward observable when you have lots of uncommon uh lots of unknowns, right? Monitoring is when you know that there is or will be a problem but not sure what to do. I am very much over time. Um And uh maybe I will have another session to go in depth about other things. But if just winding up with this, the monitoring is basically, as I said, for the nose where, you know, I need these metrics, I need these alerts. I need this dashboard. But what about the things which you don't know like Neo, which is exis existing. You need the traces, which provides a trend which provides the spans and traces for your app and tells you, OK? Usually this takes only 22 seconds per uh uh 22 calls per second. Now, I it's saying 10 or 100 that means something is gonna go wrong. You are projecting, you don't, nothing has gone wrong today and that's what is observable. Um Thank you everyone. Thanks for your time. I do have some more things to add but um yeah, um I appreciate your time.

I hope you find this such an helpful and if you have any questions, you can always ping me on linkedin and connect with me and I will be happy to reply. Thank you.