The Monitoring Experts Podcast

Monitoring insights, deep dives, use cases, and best practices.
Since 11/2022 6 episodes

Observability and monitoring

What observability is, how it developed, and observability vs. monitoring

2022-12-14 25 min

Description & Show Notes

What is observability? And how does it relate to "traditional" monitoring? We take a look at how observability has taken us from "Neanderthals with log files" to "astronauts with instrument panels", and also how it has turned the traditional concept of monitoring upside down.
Greg Campion is the monitoring expert who joins Shaun to discuss observability in this episode. 

According to Greg, observability is observing a system in the sense that you take in as much of it as you can, without necessarily looking at something specific. For him, it's about the ability to observe things from a much wider perspective.

Here's what Greg discusses: 
  • The development of observability from simple logs through to the single pane of glass (with this article as our reference point)
  • The challenges of incorporating observability into existing systems
  • The difference between observability and monitoring
  • Why Greg is so fascinated by observability.
Greg also wrote a blog post in 2019 called The rise of observability that is worth a read. He also covered observability in a video about the future of infrastructure: Why we won't care about infrastructure anymore 

Things mentioned in this episode:
Give us feedback

Get in touch on Twitter: @monitoringxpert
Website: themonitoringexperts.com
Visit the Paessler blog for more from the monitoring experts. 

Transcript

Welcome to the monitoring experts by Paessler, the podcast where we discuss all things monitoring deep dives, best practices, expert opinions and stories from the frontlines. Let's get monitoring. Let's get monitoring. Indeed. It's the first episode of our The Monitoring Experts podcast. And if you have any feedback for us or if you want to get in touch or if you have a monitoring question that you'd like us to answer in the future, or just have a suggestion for a monitoring topic that we can cover on this podcast, let me know all the links in the show notes below. And in this first episode, we're going to be covering a topic that has come up in conjunction with monitoring more and more in recent years, and it's observability. And I'm going to be joined by a monitoring expert who has some experience with observability, but he's also spent a lot of time in IT and software development worlds. His name is Greg Campion. And we're going to be talking about, first of all, what exactly observability is, and then we're going to take a high level look at how it's developed over the years and also how to build observability into processes and finally, the difference between observability and monitoring. So let's get into it. Here's Greg. So joining me on the podcast this week is Greg Campion. Welcome to the podcast. Thanks. Yeah. So before we even start with anything, maybe just give me the background of who you are and what you do here. Yeah, so I'm great. Campion. I started at Paessler in 2013 when I moved to Germany and I was here up until in Germany, up until last month, and then I moved back to California and now I went from the support team to the IT team to the development team, and now I'm in the presales team. All right, so you moved around a bit, but I just wanted to tell you that you annoy me because I get a text from you since you moved back to California and you're like, Oh, by the way, I had an idea on my way to the beach. If anyone knows Nürnberg, there's no beaches here. And it's it's frigging autumn already or fall, as Americans would say it. There's no no way we're going to the beach at this time and there you're off in California. It's like, Oh, by the way, I do my best thinking at the beach. Yeah, I mean, I was. I actually was driving the beach. What a pleasure. And now you're back in Germany for a visit? Yes. Okay. Just ping pong. Ping pong. Ping pong ing the globe. Yeah. Greg, what we're here to do is talk about observability as the topic that we're going to be talking about today. One of my favorites. But maybe let's start at How do you see observability? Yeah, I mean, observability is kind of a term that I'd never really heard that much before until I went to a big monitoring conference Monitorama, which is a fantastic conference, by the way. And a lot of people are talking about observability and these are like the most hardcore guys in the world who monitor stuff. I mean, the events called Monitorama, not observabilityrama or something. So I'm like, why is everybody talking about it? I mean, obviously when you hear the word and you know what observability means, you immediately kind of associate. It's like, okay, you want to see something. Exactly. And the idea behind it is essentially that instead of watching a system, you observe the system in the sense that you take all of it in as or as much of it in as you can, and maybe you aren't exactly looking at something specific at the time, but you are taking in basically all this data that you can then later observe to be able to come to conclusions. Okay. And so it has something to do with monitoring, but they kind of wanted to rebrand it, you know, because we've always been talking about monitoring, monitoring, monitoring. But, you know, monitoring kind of implies like this dude looking at a monitor or something, right? Like that's kind of what's implicit within that term. And so the idea of observability, I think I think personally for me, what's behind that term is just the ability to observe things from a much wider perspective. Okay. So yeah, I have a lot of questions about that because but I think we're going to get to that. But let's just take a look at the history of observability. We're not going to go too deeply into this. We're just going to kind of gloss over the big steps as part of maybe helping someone to understand what observability is, because then we're going to get back to your definition that you just gave. But there is this article that you send me. It's on a website called Open Core Adventures dot com. And the reason you send it to me is quite well explained how observability progressed. And the article is called The Rise of Observability Neanderthals with logs to astronauts with instrument panels. And what they're saying in this article is that started with logs in the beginning. So that was the very, very beginning where logs were collecting all kinds of information. Yeah. The idea is that you know it originally and not just originally. I mean, every single piece of software ever written, like the first thing you build in is logs, right? Right. It's a necessary component to any application ever. And so that's what they started with. And, you know, something goes wrong, then you go read through the logs. But as you can imagine, the more complex things get there, more logs and more logs, and then trying to find specific things. And then you're talking about, you know, tens of thousands of millions of requests. You can't read every single one of those. You know, it becomes a thing of searching through the logs when there's an issue or you're trying to find something out, I guess. And I mean, a lot of like traditional kind of monitoring used to be about parsing logs, you know, getting intelligible data out of those logs, so on and so forth. I mean, logs and logs are still the de facto standard. I mean, if, you know, you want to get to the source of truth and you look at that application's logs, but trying to sort through them is can be impossible. So so that's the beginning of the Neanderthals with logs, because you've just got all these logs and not necessarily ways to make sense of it. The next step was then, according to this article, post Y2K metrics with Nagios. I mean, that was the first piece of software that I ever used for monitoring. And my first, like, real sysadmin job, we monitored everything with Nagios, and the guy who was one of the admins there who started setting it up was hardcore Linux dude, and he just really liked Nagios and he showed it to me and I was like, Wow, it's really cool because you just you collect metrics, you start pulling data from these things. We had ton of web servers or whatever, and then you, you know, you set your thresholds and then you get information back. So instead of, Oh God, something's wrong, you know, its users are having a shitty experience, go take a look at the logs. It was proactive in the sense that, Oh hey, that one server is, you know, hard disk is full and give me an alert. And so that's where the whole metrics thresholds idea really came into play. I mean, obviously Nagios is still used everywhere. Yes. For the same thing still is you know metrics are still a very, very important part of monitoring or observability, however you want to look at it. And this is what the article refers to, is pull based metrics as well and reusable checks. Yeah, Yeah, exactly. That was the stuff that was coming along with that. Yeah. And that's that's what it was. You could repeat it. You had, you know, your templates for specific devices or whatever. You could just throw them on. You know, as soon as you add in a new something to the network, then you could just automatically apply the template to it. And you knew, okay, I'm monitoring that device for the critical things that I think may or may not fail, You know, the things that I should be looking at. I mean, that sounds pretty much like what we think of as monitoring nowadays or part of it, right? Yeah, absolutely. I mean, the push revolution was something that came about then later on. Okay. Right. The push revolution was that instead of I'm sitting here as Nagios and I'm going out and pulling this data from you, the devices themselves start sending data to you. Right. So there was kind of like this, this back and forth between push and pull. And there is there was even kind of a debate like there is some proponents who are like, no push is the best method and some say pull is the best. Yeah, the great controversy tearing apart the monitoring world. Yeah. Yeah, it was. I mean, I remember talking to my brother in law who is a monitoring expert, and he's been doing that for 15 years or whatever. And that's that's at that time, that time period is exactly what was happening. It was like, you know, should we do push or we do pull? But it was all about metrics, right? Okay. And then there was something that they mentioned. I think it was a step on its own. No, it was part of that. Actually. The golden signals is something that they call too, that correlated to most app health and user experience. Yeah, you had like the the red stuff, so you had the request rate, you know, how many are actually coming in. Then you have the amount of errors and then you had the how long requests are taking, right? So standard kind of like web service world. You know, if I have 10,000 people coming to visit my site and then all of a sudden, you know, your request duration starts going up, well, then you might have a database server that's starting to act up. Or if you started getting an error rate, then maybe somebody pushed some crap code into production and now all of a sudden it's not working anymore. So those were like kind of those three. And then Brendan Gregg changed that idea and it's now the use method. So utilization, saturation and errors essentially kind of the same thing, but a little bit different. It's pretty fascinating to read the article that he wrote. Okay about it. It's linked in the show notes as well, right? Definitely. Definitely. Yeah. It's something that I think a lot of people can get insight from because it's still a very valid way of measuring application health. Okay, before your users start complaining, because if you notice, you know, your request rate, we actually had this at this company that I was working at where, I had a sort just like that and you could we were measuring exactly those metrics to see like as soon as the duration started going up, we knew, okay, some people are hitting certain. APIs, whatever that are starting to get slower. Then you start looking at the back end and then you can kind of dissect your application and see, okay, like what database queries that we're making that are horribly inefficient. And then you as a developer can go in and start making those things more efficient, right? And you start whittling away at that because at a certain point, pretty much any backend system or database or whatever, you know, if it wasn't designed for however many requests you're now getting, which is the whole idea behind the utilization, that is. You know, if you see, okay, we're getting into the range of 50,000 instead of 10,000 or, you know, 100 million instead of 50 million or whatever, and then you start seeing it start buckling under that pressure. You see that and you can take care of that or you can start, you know, rate limiting. You can start doing there's a lot of measures you can take to kind of slow that down before your database just says, I'm out and crashes because that's worst case scenario, right? Then you're serving no customers. And depending on how badly it crashes, you know, you're also talking about some horrible data loss. You know, I mean, there's, you know, all sorts of worst case scenario. But if you're monitoring that stuff, then you can maybe nip it in the bud before it becomes. Yeah, before it crashes. And then something that I mentioned is the next step on the way to observability is Prometheus. Yeah, I mean, it's Prometheus and tracing, tracing in particular, Right. So like, so with Prometheus, you know, you just have a ton of data that you can kind of selectively that's getting pushed out or pulled out rather from the different things. So it's really great for monitoring things like Kubernetes and stuff like that. And you can monitor these really super complex...I mean, when you're talking about a huge Kubernetes instance, you have these these things that are kind of ephemeral, right? There's these containers and whatnot that are ephemeral. And so you have to monitor a lot different statistics because instead of just saying, like request rates and things like that, you're also having to monitor like your actual applications health like, but over many, many pods, as they call them. Right? So you have all these instances as smaller instantiation of this thing that you're running and so you need a much different perspective. Prometheus, provide that. As well as as tracing and tracing is, I think, something that's it's for me at least relatively new, especially with more modern stuff. So like with us, you have what they call x ray. The more open source standard is now. Jaeger. And, there's yeah, they're doing really cool stuff where you can then see, okay, this user wanted to look at this website or wanted to click on this thing and you can watch that request go all the way through your system and all the way back and see, what call, what specific thing took the most amount of time. Okay? Instead of just seeing, okay, you know, that request took forever, you can see specifically from the code what you were calling in that second that took forever, which then obviously helps you to identify where your source problem is much, much, much faster with. And when you talk about stuff like complex distributed serverless systems that are being built on us, like stuff we've built here. We have tracing enabled because yeah, you can then see like you're tracking something over not just multiple pieces of code but like multiple completely disparate services. So the microservice architecture, you can see which of my microservices is really biting the dust right now. And specifically within that microservice, you know, what is failing. Or better yet, you can also identify like inefficiencies. You know, if you see you're just like, why is this request just bouncing here and there and back and forth and you know, doing the Pacman thing like what? That shouldn't be happening, you know, And it gives you just that oversight to be able to do that. And you can even see these really neat like maps of like the clustering of different functions and things like that and identify like, okay, this this cluster right here is experience, you know, a higher rate of failure as opposed to the other ones. And then you can just dig deeper and deeper into the problem there specifically. Exactly. So it gives you a level of insight that was unparalleled, which is really cool. So that was the Prometheus revolution, as they called in this article. And then, of course, Grafana is also mentioned. Right. I mean, that's just, you know, your single pane of glass. Yeah, right. Like, you take all this stuff that we've been talking about, you can take, you know, your log aggregation, you can take your metrics, you can take you're tracing all that stuff, you shove it all into Grafana, and you're looking at real time like not aggregated metrics, right, that are able to be displayed in graphical form for human understanding within seconds. I mean, Grafana is graphing ability is, is unparalleled. It's fascinating. But yeah, it was a revolution because up until that point and even still, you know, most things that you're looking at, most graphs, most visual representations of things that you were looking at, it's very difficult to put, you know, 10 million points onto a screen. It sounds stupid, trivial, but it's not. Yeah, there's a lot of real time activity going on there that's very difficult. And so that's when you got into like the time series stuff that you got into the ability to do this in real time and zoom in and stuff. And so you're having that problem, you know, you're able to actually look at the data not 10 minutes from now, not 15 minutes from now when it's been aggregated or whatever. It's like you're looking at real time right now, millions of data points in a single graph which is or mini graphs or whatever, your ability to build these incredibly complex, wonderful dashboards with it. Yeah. So that's that's basically I mean, we skipped over some parts. We just hit like the big, the big touch points because we don't, we didn't want to dig deeply. I mean each of those steps is a discussion or isn't, and I'm sure we'll get you back to talk about different aspects of all of this at some point. But the thing that's interesting to me is, and I think this is where I struggled to grasp it a bit is so it's amazing what you're saying, like being able to see something go right through the, what functions are being called and kind of figure out which clusters you need to drill down into. But how do you build that in? I mean, I know this is probably a question with a huge answer, but but just maybe, is there a way that you could summarize? Like how would you build that into your processes? Or is it a big thing, a big conversion of way of thinking? Yeah, absolutely. I mean, you have to kind of do it at the application level or if not at the application level, like if you're not able to do it, if you're using somebody else's application or something like that, there's ways to get around that. There's ways to, to inject it into the software or whatever that you're using. I mean, but you basically just have to identify what are my critical paths, what are my critical business processes that we really need to be aware of. And then you can just start, you know, you can tack these tools on and on top of whatever you're running to be able to get that kind of observability. But it's a lot of engineering work, it's a lot of analysis, architecture, engineering, yeah. I mean, architecting a system, you know, depending on the size, of course, but architecting some of these systems will take you, you know, years and years and years and you're having to develop your own custom stuff on top of this stuff just to be able to get it into where it needs to go. And it's, you know, for each each individual company, it's going to be a little bit different. And so in order to get something that really works for you is you have to have people working on this stuff like 24 seven. I mean, the team my brother in law works in, for example, is like ten or 15 guys, you know, And that's all they do. Yeah, that's it. That's their job. Yeah, that's true. And yeah, I mean, you know, depending on the size of the company, it's, the last company I was that to is the same thing You just had a huge group of guys just taking care of that stuff because the application developers and the guys who are maintaining like the hardware and stuff that's running on it, you know, and it takes a lot of people in orchestration to be able to do that. And if you're able to provide them that level of observability, it's going to make them make better decisions. It's going to be, you know, you're going to have less downtime, which is the goal of everything we do in the industry. Speaking of the goal of less downtime, we've got this observability concept and then we've got what I mean, classic monitoring, if you want to call it that way. What is the difference between the two when you're talking about observability versus monitoring? Yeah, I mean, one of the best ways that I've heard, like the best analogies that I've heard for it is so basically it's survivor bias. So when you monitor a system, you're monitoring it for the things that you know you want to monitor. Right, Right. So, you know, your matrix, you're looking at bandwidth. You want to know what the speed you think you know, okay, this is what can go wrong. And so that's what you monitor, right? And it was actually this thing that's become very famous. But like during World War Two, right? The British, when the planes would come back, they analyzed, you know, where the plane was shot up and everywhere it wasn't shot up. They would add more armor to the planes, which is counterintuitive. Yeah, because they realized the guys who didn't come home were the ones who got shot in those places that that were. Okay, fine. On these on these planes, the ones that are still functioning. Yeah, that's survivor bias. And so at the end of the day, when you're dealing with really complex systems like we have in the modern day, you have kind of the same problem is that, you know, if you're just monitoring for the stuff that's already go wrong, then you're not monitoring for the stuff that you don't know is going to go wrong. So it's all about the matrix of like known knowns, unknown knowns and so on. And what you really do and what you're able to do with observability is monitor for the unknown unknown. The stuff you don't know is going to happen. You don't know how it's going to happen. It's just that you're watching for it, but it's going to happen. And that's the thing. That's why you're aggregating so much data, because you're basically saying like, give me every piece of data I can possibly have. The metrics that, you know, you know, are important to be able to analyze the data, the logs, the traces that are coming out so that you can immediately identify, okay, something is happening or something happened, and then you can go in and you can make intelligent queries, you can make intelligent, you can ask questions to your system that you're not able to ask. A monitoring system. Monitoring system tells you something, tells you, Hey. The web server is online. The hard drive is not full. Whatever. It's an answer to a question you've already asked. Exactly. And this is that observability system, when built properly, is something that you can ask new questions to constantly write and be able to analyze and dig through that data to find the answers you need to find. And be able to to find and fix the unknown unknowns, which is typically those are the things that are going to bring down your company. You know, those are the things that are going to take you offline for hours. Shit that nobody thought would ever happen, which is this confluence of like three or four different things where it's like, Well, if that hadn't happened with that, then we would have been fine. And if that hadn't happened with that, we would have been fine. But all four of those things happened in the same exact time, which is why the spaceship blew up. You know, it's it's it's too complex to not be able to ask those questions really hinders you because then it's more reactive than being proactive. Right. Like and that's that's what another thing that we want to be doing always in this industry is try to figure out, okay, how can we figure this out or sort this out before it hits the customers? Yeah, this is just we're obviously just scratching the surface of observability, but I think it's fine for this introduction episode. But why are you so fascinated by this? I mean, I think I can hear it, but what about it is that that's so fascinating to you? I mean, I've been I've been working in as an admin for, I don't know, 15 years. And, you know, monitoring is a big part of it. And I work for a monitoring company, right? And so we've always been talking about monitoring, monitoring, monitoring. I can't even remember when I went to this conference, I think it was like 2015, 2016 or something like that. And it was the first time that I had heard the term. And for me it was just this huge, huge eye opener. I was like, Wow, Like, this is this is super fascinating because you're thinking about something that has been around for 15, 20, 25, whatever, and it's been around forever. You know, just this classic monitoring concept and you're flipping it out on its head and completely trying to figure out a new way to prevent downtime. And it's just a very unique twist on it. And it's I think it's it's something that a lot of people could probably noodle on and really come up with, like better solutions, better ideas for how to monitor their systems, how to be more proactive and make their systems observable. And we've done that with a few of our systems here and the ability that we've had, like I was literally just 5 minutes ago having coffee with one of our other engineers. That's running our SAS product and they've built in a bunch of observability in the last couple of years and they recently had kind of this weird one off problem and they were finally able to figure it out because they've built that observability and they were able to see that thing happen in real time and analyze it. And they're like, That's why it's happening. Is this crazy race condition that nobody had ever really thought of before. And it was only like when we transferred the ownership of one instance to another instance, which happens very seldomly in the first place. So it's like, you know, it's just trying to catch like a. You know, something flying by at 100 miles an hour. And if you're just monitoring for stuff, even if you're logging stuff or whatever, it's very difficult to see because if you have the log from the one system, you have the log from the other system, everything looks totally fine. It's only when you take those two logs, put them together and then look at the metrics for that exact timestamp that you say, Oh, because it changed that at the same time it was changing that and the database just threw up on us. It's like. And you finally figure it out. And so in otherwise it probably just would have either never been solved. Right. You know, and it's one of those things where it's like it happens every now and again. You just go in there and you just, you know, change the database entry by hand, which is terrible. But, you know, you just you did it's one of those yeah, it's like as an operations person, it's just like it's just one of the hiccups you've just got to deal with, you know? But because we had built in that stuff that built in the observability in that system, now we were able to finally catch it. Yeah. So stuff like that I think is just phenomenally fascinating because I've been dealing with systems for years, that hiccuped. Yeah, you know, and it's cool to be able to just kind of twist that around on its head and be able to get actionable real results just by thinking about a problem differently. Yeah, And obviously having the ability to retain and store this kind of data and stuff like that, I mean, that's only been possible the last ten years or so. I mean, you know, you used to have have to rotate your logs every couple of days because he was just using up all your your disk space and stuff. But disk space is going to become so cheap, it's only really become possible through technological innovations in other sectors, you know. Being able to just even dig through stuff like all those logs and parse them out and stuff like that at the kinds of rates that we're talking about nowadays is only possible because of the modern processors and stuff. I mean, it wouldn't have been possible 20 years ago, so observability couldn't have existed earlier, you know. And so that's another thing that I think really fascinates me about the field is like just the ability to be able to do that stuff now. And, and the work that some companies are doing in the observability space I think is just really, really neat. And I follow a bunch of people like that on Twitter and stuff. And it's just it's fascinating to see something, I think pretty fairly revolutionary kind of being built before your eyes. Yeah, it's happening right now. Yeah. And it's constantly evolving, You know, like when this first came, like when the first the term was first being started to use, like nobody was really quite sure. And it's like slowly that definition is getting clearer and clearer and clearer and yeah, it's, and it's all, you know, hard work by industry leaders who figured out like, no this like they're proselytizing, you know, like, this is the future. We've got to be doing this like, you know, listen up and it's cool to watch that happening kind of in real time. Yeah, very clear. Well, Greg, it is so fascinating to listen to you talk about observability and you're going to be back on the podcast in the future. Are you going to be kind of a regular guest that pops up? I'd love to, yeah. So we look forward to hearing from you again. Thank you very much for your time. Music.

Feedback

Got a question about monitoring? Or a suggestion about what topics we should cover? Or just feedback about the show in general? Get in touch!

By clicking on "Send message", you agree that we are allowed to process your contact information for the sole purpose of responding to your inquiry. The form processing is handled by our Podcast Hoster LetsCast.fm. You can find more information on their Privacy page.

★★★★★

Do you like this Show?
Give us five stars on Apple Podcasts