AppDev Done Right Summit | Research Spotlight - Day 2 - Operate with Resilience

Clips
More from AppDev Done Right Summit

Bob Laliberte

Principal Analyst

theCUBE Research

play_circle_outline Enhancing Modern Applications: The Role of Observability and Incident Management in Continuous Improvement and Swift Issue Resolution

play_circle_outline Maximizing Application Performance: The Rising Trend of Full Stack Observability in Today's Organizations

play_circle_outline Reducing IT Downtime Costs: How Automation and SRE Practices Enhance Outage Resolution by 27%

play_circle_outline Enhancing Team Performance: How Feedback Loops and Incident Data Drive Continuous Improvement and Reduce Repeat Incidents by 30%

Info
Transcript

Research Spotlight - Day 2 - Operate with Resilience

Bob Laliberte

Principal Analyst theCUBE Research

TheCUBE’s Paul Nashawaty and Bob Laliberte focus on “Day 2: Operate with Resilience” at the AppDev Done Right Summit, discussing the operational strategies behind long-term application performance. Their conversation centers on observability, incident management and continuous improvement as pillars of modern resilience.

Nashawaty and Laliberte highlight the shift from legacy monitoring to advanced observability platforms, noting that most organizations now use multiple tools to maintain performance. They explore how this toolset supports faster incid... Read more

explore Keep Exploring

What is the focus of the discussion in this session regarding observability and its significance in the CI/CD pipeline? add

What do organizations report regarding their use of observability tools and their preference for full stack observability solutions? add

What are the implications and statistics related to incident management and its impact on IT downtime costs? add

What factors contribute to the continuous improvement of application performance and delivery in organizations? add

bolt Powered by CUBE AI

Research Spotlight - Day 2 - Operate with Resilience

search

Paul Nashawaty

>> Welcome to day two research session of the AppDev Done Right Summit. We are taking a look at a deep dive into critical operations that keep modern applications running smoothly. So today's focus will be on observability, incident management, and continuous improvement. The key practice is that it enable teams to detect issues early, respond swiftly and refine processes of ongoing reliability and performance. My name is Paul Nashawaty. I'm the practice lead and principal analyst for the AppDev Practice at theCUBE Research, and I'm your host for today's session, and I'm joined today by Bob to discuss proven strategies and tools that really help elevate operational excellence to drive smarter, faster innovation. Bob, how are you doing today?

Bob Laliberte

>> Doing great, Paul. Thanks for having me on.

Paul Nashawaty

>> Great to have you on. Hey, so why don't you introduce yourself to the crowd?

Bob Laliberte

>> Yeah. Hey everyone. I'm Bob Laliberte, principal analyst here at theCUBE Research covering the networking and observability space.

Paul Nashawaty

>> Awesome. Bob, thanks for joining me today. In this session, you and I have had a lot of conversations throughout the years on observability, on the impacts to the market and what's going on in this space. So there's a lot happening. In this session, we're here to talk about, discuss the practice of observability, but not just observability. It's about operations and what it means for day two overall in the CI/CD pipeline across the SDLC. But this observability in particular provides that visibility into teams that really help understand the health and behavior that complex applications in real time. So incident management ensures that when issues arrive, they can be actionable insights, so they're quickly identified, prioritized, and resolved with minimal impact. We also see that continuous improvement. It really ties it all together by using those insights from operations to really refine the process, strengthen resiliency, and drive ongoing innovation. So Bob, let's just jump right into this because there's a lot going on. We'll start with observability. We look at the organizations that are leveraging these observability tools. What we find in our research is 75% of organizations that responded to our recent survey, fresh data just came out of the field. They use six to 15 different observability tools to understand their environment. This is a challenge because 57% of those respondents indicate that they really are in favor of a full stack observability solution. So when we're talking about to gain real time insights and proactively prevent application issues, 87% of ITDMs consider observability essential to modern application performance and reliability. I know that you just came out of a session. We've all been on the road lately. It seems like forever, but you just came out of a session and there was a lot to talk about with our observability. What are your thoughts around these data points?

Bob Laliberte

>> Yeah, absolutely, Paul, and it's really interesting, especially when you're putting it in the context of an application developer, because covering what used to be the monitoring space for so long with application performance monitoring and network monitoring and so forth, it's really been these modern application environments that have transformed from not only being dynamic but ephemeral, which have really driven the need to have these observability tools and that the huge delta there is that one, it had to be extremely granular. It had to be everything needed to be collected. The idea of sampling no longer was viable because in five minutes, a lot could happen and forget about five minutes. In a few seconds, a lot could happen. So the birth of these observability tools was really a response to how do we effectively manage these modern application environments and modern IT environments supporting them to ensure that organizations can have that availability, that performance that they need on a continual basis. And so the observability tool is just the first leg of that stool of being able to be able to actually see what's going on, to be able to recognize. And the other big piece of observability obviously is you used to monitor for things that you knew about. I want to check on these types of things. For observability, you don't know what you need, you don't know what might be happening. So it's about collecting all of that data from a variety of different sources. And so that's what makes it... You talked about some of those stats with all the different tools. The problem there is that how do you... What you need is all the information. You don't need it siloed in separate data puddles. You need to be able to have a consistent view, a consistent, if you will, it's almost trite to say that source of trick that allows you to be able to identify what's going on so that you can then take the appropriate steps to fix that.

Paul Nashawaty

>> Yeah, absolutely. And I love the kind of comment that you made is the one leg of the stool. But that leg has so many different layers. When we look at some of the additional stats that talk about this full stack observability adoption to reduce meantime detection or MTDD, it's up 68% what we're seeing that year over year. That's a big jump over year over year because folks, to your point, Bob are looking at that unified view and the observability market alone, observability is projected to be $9.3 billion by 2028. That's not too far along, and this is driven by the demand of proactive operation. So Bob, when we talk about this, there's a lot here that observability is key. It really is kind of evolving with the maturity within organizations. So some people are immature, some people are very immature, some people are logging, some people are tracing, some people were doing APM or NPM, but there's different elements and that has to be all taken into consideration.

Bob Laliberte

>> Yeah, absolutely. Like I said, it's all about having that comprehensive view across the network. And the other big piece of this is that it's not like days of not too distant past where all the applications were securely locked in the castle, everything was in a single data center or just a couple data centers, and it was heavily protected. Now you've got applications that are distributed across not only multiple data centers, but multiple public clouds and edge locations. And so that ability to have visibility, the number of potential areas where things can go wrong, but you need visibility into right now, extends to the WAN environment, could be edge compute environments, et cetera, right? It's no longer confined to just a data center. So again, that realm of what's possible, what could possibly go wrong, really extends that importance of having these observability solutions that can cover that entire environment and provide a consistent unified view of what's going on.

Paul Nashawaty

>> Yeah, absolutely. When we talk about legs of the stool in this kind of segment, I'm talking about three legs of the stool. We talk about observability. Secondly, we're going to look at, and it ties into where you're going at, is really understanding the incident management. And when we look at incident management, it's really part of that observability view that the operational side of day two operations, when we look at what is process and technologies are being viewed by the organizations that we're talking to and how are they ensuring effective incident detection, response and resolution in development and operational teams. When we ask this question with these organizations in mind, you and I both talked to both the end users as well as the vendors, and there's different ways to look at it, but we find that the average cost of IT downtime is $5,600 per minute or over $300,000 per hour. That's a lot of money to impact your organization. And we also see that organizations use an incident response incident automation and SRE practices, they resolve outages 27% faster. What are your thoughts around this?

Bob Laliberte

>> Yeah, no, I think a lot of things. One, I would start by saying certainly depending on the industry, those downtime numbers could be really low. There's a lot of organizations, especially in peak seasons, where the cost of having any kind of an outage is much higher than that, could be hundreds of thousands of dollars a minute. And that really raises the importance of having the ability to quickly find and fix problems when they occur. And so I think, again, you can't fix it if you can't see it, right? If you don't know what's going on, there's no way you're going to be able to see it. So it plays a vital role. A lot of these observability tools are able to feed in other tools specifically for applications or networking, et cetera. But also I think we're also starting to see you brought up that point about we need to get out of that reactive mode. Yes, it is great that if you have a problem that you can quickly find it and be able to fix it. Even more importantly, you want to shift to being more proactive and even if possible, predictive. So being able to have those information, being able to look at trending information, being able to spot issues is going to be really important for this incident management so you can start getting ahead of issues. And that's where I think we'll start to see AI playing a bigger role and trying to make sense. Because the reality is now that we're collecting all this highly granular data across a wide and highly distributed environment, it's getting to the point where there is so much information, it's really beyond human capacity to be able to digest it, correlate it, and come up with what that incident is and that problem is, and that's why we're seeing so many organizations shifting and leveraging AI technologies to help them go through and perform that for incident management and be able to get to that root cause as quickly as possible.

Paul Nashawaty

>> Yeah, I can't believe we made it this far into the discussion without talking about AI. I mean, just really, it makes a lot of sense, but I mean, this is really not new though, right? I mean, when we were talking about this several years ago, we were talking about AI ops, right? We were talking about what that means from the... You and I kind of joke around because I cover that business logic piece. You cover more of the infrastructure side, and it's like the plumbing side of it is more the AI ops side of it, but that's incredibly important. I think one of the other stats I want to highlight in this kind of section is we were talking about the unified view, and 74% of organizations say that's centralized incident communication, improved resolution coordination across teams faster. So this is an incredibly important point. To your comment earlier, it's when you're working with the business logic and when you work with the infrastructure, those two things need to work together. And the AI systems that are looking at it cannot just look at one part of it. It has to look at it holistically so that the SREs, the DevOps teams, the platform engineering teams, all need to kind of come together and work together in order to achieve the common goal while rolling towards the same North Star. And that's what we have to get to. So I mean, I think you're right. I think that that's an area that is going to evolve. Again, I think it does depend on maturity, but it's going to evolve based on organizational needs. One of the other things that we talk about in this day two pillar that research is continuous improvement, right? Actionable insights matter. When we look at how organizations are talking and improving resolution, organizations need to have their teams and their ecosystems incorporate feedback and operational data to improve application performance and delivery continuously. That's part of the CI/CD pipeline. So what we find is companies with a post-incident review culture and continuous learning loop, they see 30% reduction in repeat incidents. This is not just coming from the Q Research. This is also backed up by the DORA research that we see as well with the DevOps metrics that we're looking at. We also see that 83% of high performing teams, they conduct retrospective using incident data to improve quality and deployment health. This is an area that is incredibly important because you were talking about we need to be more proactive than reactive. We need to be taking action before something happens. And that's exactly to that point, Bob. So what are your thoughts there?

Bob Laliberte

>> Yeah, no, absolutely. I mean, it also comes back... I started thinking about this continuous improvement. You start thinking about Six Sigma efforts and so forth from manufacturing plants and so forth. And there are a couple of books not a couple years ago talking about the comparison of AI to the manufacturing. So I love the idea of continuous improvement, and I think it's critical, right? That's how you keep in. That's the only way to have that incremental improvement is by learning from your past mistakes. And we've seen it not only in app development. I've seen it in a lot of the AI tools. We used to call them a closed loop system so that you could get feedback. The new term you'll be hearing a lot more about is human-in-the-loop with these AI tools about how you're able to get and collect feedback. And I think it's critical because as more technology takes over the finding, the fixing of it, it really needs to leverage the expertise of the users today, right? The application developers, the people or the SREs and so forth, they've built up an inherent knowledge base and all that needs needs to be codified. And in order to do that, one of the ways is having this continuous feedback loop being able to understand, and whether it's just within your own entity or you're working with your vendors to be able to provide that feedback. But that's going to be so critical to ensuring that identifying when you see something, what the incident was, how it was fixed, and then being able to hopefully the next time be able to avert that or predict that it's going to happen so you can proactively fix it before there's an issue. I think this continuous feedback loop is going to be super critical for that.

Paul Nashawaty

>> Yeah, no doubt. I mean, we were seeing in our research that DevOps team applying AI and ML to operational data experience 20 to 40% gains and anomaly detection and proactive alerting, exactly to your point. And so the data speaks, right? We can sit here and have a conversation about our opinions on things, but the data is the data, and this is a global study that it's fresh out of the field. I'm really excited about it. And Bob, this has been awesome talking to you. I mean, I hope that the audience got a lot out of the information here. There's a lot to kind of talk about. The sessions before this really led up to day zero, day one, and now we're talking about day two research. The client sessions, the customer case studies that we're seeing moving forward, you'll see in the next few sessions that are coming up, are really going to amplify some of these discussions that we're doing. So I encourage the audience to look and watch more. But thank you Bob, and thank you to the audience for joining in this deep dive observability discussion and incident management and continuous improvement in the AppDev Done Right Summit. By embracing these practices, organizations and operations can respond faster in challenges and building a culture of resiliency as well as ongoing innovation, which is what a lot of developers want to do. They want to innovate. They don't want to be in maintenance mode. But as you're moving forward, remember the key to operational excellence lies in continuous learning and adapt it. So when we're looking at this, keep applications reliable, performant, and ready for whatever comes next. Thank you and look forward to seeing you on the next sessions and keep engaged for the remaining parts of the series. Thanks and have a great day.