Cyber Resilience Is Everything with Cohesity
November 19, 2024 | 6:00 PM - 7:00 PM UTC

Explore cyber resilience with James Blake, vice president of cyber resiliency strategy at Cohesity. In this CUBE Conversation, Blake talks with theCUBE's Christophe Bertrand about tackling ransomware, building secure recovery environments, and fostering collaboration across IT, security and vendors. Learn how Cohesity's clean room approach and integration with industry frameworks empower organizations to recover quickly and stay resilient.

Christophe Bertrand
Principal Analyst SiliconANGLE & theCUBE
James Blake
VP Cyber Resiliency Strategy Cohesity
search
Christophe Bertrand

>> Hello,
everyone and welcome to theCUBE. We are going to talk about cyber resilience today. My name is Christophe Bertrand. I'm here for a Cube conversation with a great specialist in incident response who is going to tell us a little bit more about what Cohesity can do to help you when the time comes, but more importantly before the time comes. So James Blake, welcome to theCUBE for this conversation. Why don't you tell us who you are, what you do, and your background, which is fascinating.
James Blake

>> Yeah,
my name is James Blake. So my role is I'm the Head of Cyber Resiliency Strategy at Cohesity. So I spend a lot of my time not talking so much about technology, but about the people and process, the operational objectives that we try and get using technology. And then I use that to inform our product strategy, how we deliver messaging around our products, and ultimately make sure I keep the business honest, focusing on the problems customers have rather than just the features of the product. And my background is prior to joining Cohesity, I came from a background of running an incident response practice for Hewlett-Packard Enterprise. We built over 91 security operation centers. I ran Global Cyber Risk for JP Morgan Chase, and I'm the ex-CISO of Mimecast. So I've spent about 30 years of my life in cyber security and a lot of it dealing with incidents.
Christophe Bertrand

>> Well,
James have to say, risk is really your business in many ways. And that's interesting here, because obviously some of our viewers come from the security space and others come from maybe the storage or the backup and recovery space. And when it comes to cyber resilience, it's really a team effort. And I very much like the experience you bring to the table in helping us understand how to make this work, because it is truly a team effort bringing together IT ops, cloud ops, security ops, and a bunch of other folks in the process. So let's talk a little bit about how Cohesity approaches this concept of cyber resilience. And one of the things that's very helpful is to look at some existing models. Actually, maybe we can bring up a quick slide here that you were very kind to share with us. It shows a few of the existing models. There's the NIST model, SAMS, and of course your own approach, which we'll talk about in a lot more detail. Tell us how does Cohesity in general terms approach these models? And why did you end up building your own that you called a Clean Room? And we'll explain what that means in the context of IT and security.
James Blake

>> Yeah,
I think having been on the other side of the table and being subjected to vendor positioning for years, I was quite keen as an organization that we don't do that. We kind of don't create our own framework or workflow. What we did is to align two industry ones. And as you said earlier, incident response and cyber resiliency is a team effort, right? Business continuity and disaster recovery, just bringing back the last snapshot brings back the vulnerabilities. It brings back the persistence mechanisms and all the gaps or evasions in your controls. So you don't do that in a cyber instance. You have to investigate and you have to mitigate. And we already have well-structured processes for this, the NIST one and also the SANS six-step incident response life cycle. So when we start to look at what utility and how would you structure Cohesity to be used in that context where we actually discover what the adversity is doing, we stop the bleeding through containment, and then we actually mitigate the threats. It just made sense to align with those frameworks that most organizations were already familiar with it, at least their security teams. And then, take those IT teams on the journey of actually understanding how they have to transition from the strategies they may have around dealing with flood, fire, earthquake, power loss or equipment failure, and then actually doing the right thing for cyber incidents.
Christophe Bertrand

>> Thank
you so much for these details. It actually brings up I think another question which is, okay, we have these frameworks, clearly cyber crime is not stopping, ransomware is rampant. Organizations have many challenges. Before we go into the Clean Room component, which I really want to double click on, what would you say the top challenges are that every day you see your customers or prospects are facing?
James Blake

>> Well,
I think the first one I already alluded to is the fact that organizations are treating cyber incidents like a traditional BCDR incident. So what's happening is they're seeing the unavailability of systems that provide products and services as a business continuity issue and handing it wholesale to IT, and then sometimes IT aren't even involving security in that process. What they're doing is they just think the traditional process of taking my last snapshot, putting that back into production. And what we see in those instances are customers that have to recover multiple times because ransomware as a service, they've got multiple affiliates that are targeting the same vulnerability. So if you bring a system back without catching it and understanding its vulnerabilities, it just gets re-tacked again or they've left persistence mechanisms in there or evasions of controls. These things just get recovered in the backup. And we see customers, they recover and they get hit again within two minutes. And they're doing this multiple times. They promised the business an RTO of let's say a day, four days, yet it ends up being 15 times four days because they've got failed recoveries and just reinfections again and again. Usually customers stop at that point in time and, "Let's start investigating the incident." What we're trying to do is get ahead of that and make sure we've got the right environments for customers and the right workflows and processes to do that, to really shorten that cycle and make sure they don't suffer from those problems. So that's the first problem. The second problem is organizations don't consider the dial tone services. They're not in the business impact analysis. And what I mean by that is I've dealt with incidents where people have not been able to make telephone calls, not being able to get physically out of offices because voice-over IP is down, because physical access control is down. These incidents can impact these services that we need to even respond and communicate with law enforcement, insurers, our stakeholders, the press. We need those systems back up. And then, finally, the last problem is a lot of our security products are moved to the edge. We're heavily reliant on endpoint detection and response. What's the first thing we do in all those frameworks we just saw in the last slide? Containment. So all of a sudden, you've created an island. These endpoint security tools can never be used for response and the hunting of those activities or even remote forensic imaging. That's another problem that we've got. Not to mention the fact we look at the MITRE ATT&CK framework, the way we describe adversaries and their behavior, the tactic with the most amount of techniques under it is defense evasion. If we're a hundred percent reliant on just tools that sit on the endpoint, we can sometimes be evaded, and those evasions are being built into these ransomware as a service platforms. So they're the three biggest issues I see in organizations.
Christophe Bertrand

>> Yeah.
And these are fundamental, because I think it really explains why you have to look at it from a very different standpoint. I've been tracking and been a part of many product lines that focused on disaster recovery. And it's true that in many ways disasters are sort of predictable or you kind of know what you're dealing with. And here, you lack all sorts of predictability, and more importantly, you don't really know what hit you and the extent, the scope to which you've been affected. And maybe even you don't even know for how long those people have been observing you. So I think it's time to talk more about your approach with a Clean Room, which of course it's not a clean room in the medical sense or industrial sense that would be an environment devoid of any infection, which is kind of the idea virtually here from an IT standpoint. And really, the idea is to look at this from, again, a very methodical perspective. So let's actually bring up the workflow here. Let's start with the first two components around prepare and initiation. So obviously, it's two components that sort of have the same name, but they're not the same at all. And they are fundamental for the rest to work, which we'll cover in a minute. So James, can you tell us more about these two steps and why are they so important? And by the way, what is that Digital Jump Bag? I'm curious about that.
James Blake

>> Well,
the reason why this exists is to solve one of those problems that I described earlier, and that is the fact that organizations typically do not consider the assets and resources that they need to be able to respond and recover from the incident. So if we look at every BIA that most organizations do, they focus on the most critical business applications that deliver products and services to customers. What they don't consider is, "What do I need to have in place in order to do investigation and the mitigation of threats in those platforms so that I can bring them back securely?" And often, what we find when I used to walk into customers to do incident response is the first two to three days of an incident might be trying to find those switches, those workflows, those asset databases, contact lists. All of these things that we need to be able to communicate and collaborate and run the incident are often missing. And the time to get that is not in the middle of an incident. If you can prepare these resources and put them inside what we call the Digital Jump Bag, which is effectively an immutable vaulted storage area which has authentication mechanisms which are outside your traditional business ones which could be impacted by the incident themselves. So I love using a TOTP authentication mechanism on my phone. So even if you scorch the earth and your entire organization has been flattened, you can get this Jump Bag up. And it can be instantiated and mounted on your internal system, and within minutes or hours you can rebuild that trusted environment, which is the initiate stage, get my tooling back to a state inside a confined area where I know it can be trusted, rebuild those email servers, a active directory server or some other form of identity management server that you are using to then authenticate those response capabilities and obviously get physical access. So it's important to understand that this initiate stage is doing nothing about recovering business services, which are producing products and services, it's all about just what we call the minimum viable response capability; just knowing you can trust your tooling and getting that up into a state where you can communicate with all of the parties that you need to join the incident and handle the incident.
Christophe Bertrand

>> So
absolutely, isolation, zero trust, no trust, and the ability to get back on your feet with an infrastructure that you trust, that you believe will allow you to get business processes back in place. Which is interesting, because in reality, you are right. So we used to think it's just the data that can get corrupted or deleted. Well, it's getting leaked, that's another issue, but it's also the infrastructure. If you can't access or log in to your business applications because your network or your identification tools have been affected, then there's no business. So I think it's very important to be comprehensive, and I think what you've just explained here demonstrates why it is that you really need to have this collaboration between security teams and infrastructure teams across the board. It's truly a team sport. So let's talk about the last couple of stages, which is really where the solution sort of kicks in. So I'm going to bring that up real quick here, and what you're going to see is that of course once you're back on track, you are going to start looking into what happened and then hopefully recover. So tell us briefly about those two stages, and then again tell us about the migration piece. Because you have to sort of get back into production state. You can't just run from an isolated environment in production for a long period of time.
James Blake

>> Yeah.
And so, the important thing to remember is that initiate stage, it's only about getting the response capability, security tooling, the identities needed for response and recovery, not the identities of the whole company. So what you might find is what we used to call when I was delivering operational resiliency for a large bank, what we call resiliency category one apps. They are the most critical apps that we needed to run the business. But underneath that there's almost resiliency category zero, which are the things that security and IT need to do to manage the incident. So we've dealt with those that initiate, but now what we're starting to do is work on that critical business application. So this is when we actually use the native capability of a data management platform like Cohesity to aid overcoming some of those problems around the containment introduces. So when I talked about the fact that now your endpoint security solutions are isolated because we disconnected the networks or the hosts to prevent spread of the infection, well, we can still conduct forensics on the file systems, which are within the data management solution. In fact, we can add time travel across the time of the incident because it's not like traditional forensics where we only see a file system at the end. We can actually now see a file system across the entire timeline of the incident, which really empowers analysts. And I think it's an untapped resource that a lot of people in the security operation center don't know exists. But then, you've got things like threat hunting. If you can hunt for IOCs and you're relying on endpoint security controls, well we know they can be evaded. And there's 43 techniques in defense evasion in MITRE ATT&CK, more than in any other stage. And people are baking things like EDR killer into their ransomware as a service platforms. But guess what? If you are using that offline data that exists within a data management solution, you can't be evaded using those traditional techniques. And we still work off the containment. So actually, those problems that are introduced that I mentioned in the last section are actually solved by using a data management solution to do that. But it's a team sport, you mentioned that, not just between IT and security, but between security vendors as well. And that's one of the reasons Cohesity built the Data Security Alliance with leading vendors like Splunk and Cisco and Palo Alto and Zscaler, just to name a few, is because we work with those solutions. So when you go into containment, some of those solutions no longer have access to live systems, but we can provide that context of the data to them even in containment. So we work on providing these native capabilities to speed investigation and also through our integrations with third parties to make sure that you truly understand how they got in, how they maintain persistence, how they've evaded controls, so that you can actually then move to the next stage which is mitigate. So typically, the investigation environment is owned by security. And now what we do is we go into a mitigation stage. And there are two strategies to mitigation, and that can be rebuild systems, and this is where you can hold the golden masters of those systems and trusted configurations in the Digital Jump Bag, so we are literally rebuilding a system, or you can recover a system and then clean it. So you do a volume level backup, you're able to recover that system, and then clean with that information you learned from the investigation stage, those artifacts off there. So we support both approaches. And you might actually choose to support both, and then on a case-by-case basis after the incident go, "This system's only got three artifacts on it. I'm just going to clean it." Or you go, "I've got 17 tasks to make this system ready for production again. What I'm actually going to do with this one is rebuild it because the level of effort is lower." It's all about squeezing that response timeline down. And then, finally you need to test those mitigations, patching things. All those mitigations might introduce functional performance problems, so typically we recommend customers can use a dev environment as their mitigation environment, and then what they do is snapshot that entire environment and then lift that back into production. So we actually replicate the production environment in the mitigate stage, and we do that by holding the hypervisor or network configurations in the Jump Bag so that we're able to, for each set of workloads that we're bringing in, configure that environment, snapshot it so if you didn't catch everything, you don't have to go back to square one, you can just restore that snapshot, do further investigation mitigation. And it's a pipeline. Once you got to mitigation with one set of workloads, you've already got your next level coming in through the investigation stage. And this is really the process we do in incident response. It's just normally as incident responders, we charge customers an awful lot of money to do it.
Christophe Bertrand

>> Wow.
So really, the takeaway here for me and for our viewers I hope is this is a very different world. I think it's going to fundamentally change how you architect your teams. It's going to foster better collaboration across multiple teams and you also mentioned the ecosystem of vendors. I think that's actually very critical, the people piece of it which we want to have time to discuss, is probably as critical as the process and the technology. So we've covered a lot of ground here, James. I'd like to thank you so much for your time. I think there's a great solution here. It definitely has a ton of different components that set it apart. I believe that it is much needed in a market that is seeing a ton of attacks. It's not going to go away. And it's great to see how Cohesity is bringing to the table people like you and really trying to bridge the gap between, well, what used to be different worlds siloed maybe a little too much, and this new world, which is not necessarily great, but at least one in which we have options to be able to recover and be more resilient. And remember that every time you invest in making your infrastructure, your environment more resilient, it's an opportunity to improve the business as well. It goes straight to the bottom line and probably the top line too, because you can still transact. So James, thank you so much for joining us today. And to our viewers, thank you for your time. Bye.
person_outline 4
KM
Kristen M.
KO
Kevin O.
Frank E.
Ben F.