James White, Calypso | NYSE Breaking News
Join us for an insightful conversation with James White, the Chief Technology Officer at CalypsoAI, hosted by theCUBE's John Furrier. In this video, White discusses the groundbreaking release of the CalypsoAI security index, which provides a comprehensive assessment of gen AI models. This index is the first of its kind, offering safety rankings and security insights, crucial for enterprises integrating AI into their operations. The discussion is further enriched by theCUBE Research and analysts, who bring their expertise to the table.
Key takeaways from the discussion include an exploration of the security impact of AI models and how CalypsoAI's new index helps enterprises ensure safe deployment. White emphasizes the importance of selecting models that balance quality with security, helping enterprises make informed decisions. The conversation also touches on the concept of Agentic warfare and the role of AI in enhancing cybersecurity, according to White. Find more SiliconANGLE news and analysis at https://siliconangle.com/. Follow theCUBE's wall-to-wall event coverage at https://siliconangle.com/events/ and learn about the latest theCUBE events at https://www.thecube.net/.
#CalypsoAI #GenerativeAI #Cybersecurity #AI #CyberResiliencySummit #Microsoft #AWS
00:00 - Intro
00:06 - Exploring AI Security: An Introduction to theCube Studios and CalypsoAI's Security Index
02:08 - Securing AI Models: The Role of Red-Teaming and Product Testing
04:32 - Agentic Warfare and AI Security
06:57 - Decoding Cybersecurity: Symmetry and CASI Insights
09:35 - Key Metrics and State of AI Models
11:50 - Leverage of CASI for Security and Risk
14:04 - Evaluating AI Models: Performance Metrics and Leaderboard Insights
16:50 - AI Dynamics: Navigating Agentic Warfare and Leaderboard Strategies
19:19 - Reflection and Recognition: Understanding the Leaderboard
Forgot Password
Almost there!
We just sent you a verification email. Please verify your account to gain access to
theCUBE + NYSE Wired: Mixture of Experts Series. If you don’t think you received an email check your
spam folder.
Sign in to theCUBE + NYSE Wired: Mixture of Experts Series.
In order to sign in, enter the email address you used to registered for the event. Once completed, you will receive an email with a verification link. Open this link to automatically sign into the site.
Register For theCUBE + NYSE Wired: Mixture of Experts Series
Please fill out the information below. You will recieve an email with a verification link confirming your registration. Click the link to automatically sign into the site.
You’re almost there!
We just sent you a verification email. Please click the verification button in the email. Once your email address is verified, you will have full access to all event content for theCUBE + NYSE Wired: Mixture of Experts Series.
I want my badge and interests to be visible to all attendees.
Checking this box will display your presense on the attendees list, view your profile and allow other attendees to contact you via 1-1 chat. Read the Privacy Policy. At any time, you can choose to disable this preference.
Select your Interests!
add
Upload your photo
Uploading..
OR
Connect via Twitter
Connect via Linkedin
EDIT PASSWORD
Share
Forgot Password
Almost there!
We just sent you a verification email. Please verify your account to gain access to
theCUBE + NYSE Wired: Mixture of Experts Series. If you don’t think you received an email check your
spam folder.
Sign in to theCUBE + NYSE Wired: Mixture of Experts Series.
In order to sign in, enter the email address you used to registered for the event. Once completed, you will receive an email with a verification link. Open this link to automatically sign into the site.
Sign in to gain access to theCUBE + NYSE Wired: Mixture of Experts Series
Please sign in with LinkedIn to continue to theCUBE + NYSE Wired: Mixture of Experts Series. Signing in with LinkedIn ensures a professional environment.
Are you sure you want to remove access rights for this user?
Details
Manage Access
email address
Community Invitation
James White, Calypso
James White, Calypso | NYSE Breaking News
Join us for an insightful conversation with James White, the Chief Technology Officer at CalypsoAI, hosted by theCUBE's John Furrier. In this video, White discusses the groundbreaking release of the CalypsoAI security index, which provides a comprehensive assessment of gen AI models. This index is the first of its kind, offering safety rankings and security insights, crucial for enterprises integrating AI into their operations. The discussion is further enriched by theCUBE Research and analysts, who bring their expertise to the table.
Key takeaways from the discussion include an exploration of the security impact of AI models and how CalypsoAI's new index helps enterprises ensure safe deployment. White emphasizes the importance of selecting models that balance quality with security, helping enterprises make informed decisions. The conversation also touches on the concept of Agentic warfare and the role of AI in enhancing cybersecurity, according to White. Find more SiliconANGLE news and analysis at https://siliconangle.com/. Follow theCUBE's wall-to-wall event coverage at https://siliconangle.com/events/ and learn about the latest theCUBE events at https://www.thecube.net/.
#CalypsoAI #GenerativeAI #Cybersecurity #AI #CyberResiliencySummit #Microsoft #AWS
00:00 - Intro
00:06 - Exploring AI Security: An Introduction to theCube Studios and CalypsoAI's Security Index
02:08 - Securing AI Models: The Role of Red-Teaming and Product Testing
04:32 - Agentic Warfare and AI Security
06:57 - Decoding Cybersecurity: Symmetry and CASI Insights
09:35 - Key Metrics and State of AI Models
11:50 - Leverage of CASI for Security and Risk
14:04 - Evaluating AI Models: Performance Metrics and Leaderboard Insights
16:50 - AI Dynamics: Navigating Agentic Warfare and Leaderboard Strategies
19:19 - Reflection and Recognition: Understanding the Leaderboard
In this interview from the theCUBE + NYSE Wired: Mixture of Experts series, James White, CTO at CalypsoAI, joins theCUBE’s John Furrier to unpack CalypsoAI’s newly launched Security Index – the first comprehensive safety ranking of major generative AI models. White explains how the weekly updated leaderboard and the CASI (CalypsoAI Security Index) score enable apples-to-apples comparisons that blend quality and security, helping enterprises move beyond POC purgatory and toward ROI. The discussion connects model selection and risk posture to enterprise strateg...Read more
exploreKeep Exploring
What considerations should companies take into account when selecting a generative AI model for production, particularly regarding quality and security?add
What are the considerations and methodologies involved in testing AI models for security purposes?add
What is the assessment of Anthropic's efforts in developing safe generative AI models?add
What does the acronym CASI stand for, and how does it relate to security in the context of model usage within a company?add
What is the role of the Purple Team in managing new threats and ensuring safety in AI deployments?add
>> Hello, welcome to theCube Studios here in Palo Alto, California. I'm John Furrier, your host of theCube. We are here for some AI news. We have Jimmy White, who's the CTO, Chief Technology Officer at CalypsoAI. They've been formerly on theCube, the co-founder has been on our Supercloud Six. Great to see you, Jimmy, thanks for coming on. We've got some news to unpack here, you guys just released your security index, just launched and providing the first comprehensive look and also safety rankings of all the major gen AI models. There's a leaderboard and everything, so we'll dig into it, thanks for coming on.
James White
>> Thanks so much for having me, John. Pleasure to be on and looking forward to unpacking all this.>> We saw the news hit the Wire. Really, you guys are really the first solution to secure this model inference layer. Actually inference is a killer app, training inference, reinforced learning, it's the new thing for AI. Obviously, all the data's in there. You guys have been building a system to really kind of watch the data leakage misuse, really trying to get the enterprise to have that confidence, the enthusiasm, super high, right? Gen AI at scale, we're seeing and reporting on siliconagle.com and our Cube research team that a lot of the enterprises certainly have ushered in machine learning for things like fraud detection and applications and systems that they have. But generative AI has great benefits, super exciting opportunities, but they're still in the sandbox, they're stuck in what we call POC purgatory because they want to be 100% sure you cannot be wrong. And so there's a lot of work going on this end-to-end area in the enterprise, super hot area. So I'm super excited to see this index because you guys are really kind of putting the test, the stress test, you have techniques in this. Take us through what this index is, first, explain the news. What is this security index? It's the first of its kind. How are you doing the ranking? Take us through.
James White
>> Yeah, no. So a lot of companies are struggling with the idea of getting to production, how their gen AI investment yields that ROI they're all chasing. And so one of the critical aspects of selecting the right model for your use case is the quality metrics for that model, so how good is this model it doing while you need it to do for your use case. But often forgotten is what's the security impact of that model? Is it good for that use case? Does it have flaws? Does it have things that it's vulnerable to? And so our new security index, the CalypsoAI security index aims to solve that problem by showing you how each model performs in key certain areas, giving you a score that you can compare and contrast against other models to answer that question, which model is the combination of best quality and security for my use case?>> We've had Neil on since the co-founder of the company on the previous Supercloud we've had. We learned a lot about your company, you guys have a product, a Red-Team product, I love that interview. He came on our Supercloud Six with me and Dave Vellante. Talk about, before we get into some of the how you guys did this, set the table for us. Neil kind of brought this up last year, CalypsoAI has a security inference layer kind of product, it's called a Red-Team product. What do you guys have? Why are you guys doing this?
James White
>> Yeah, so we began our journey with an enterprise with defense. And so defense was all about protecting the use case when it goes to production. But in fact, the security journey starts way upstream from there. So when you're selecting which use case that you wish to use AI for, there's a problem out there that when everyone has a hammer, every single thing looks like a nail, so solving everything with AI is not the right way to go about it. Selecting the right use case and then selecting the right model for that use case. So if it's a highly regulated industry, you need to use a model that is appropriate to use in that setting. If you have a proprietary information you don't want leaked or if you're interacting with miners, if it's a company that allows models to be used by children or teenagers, it's imperative that you have a model that's fit for purpose. So our customers were looking for a way to test various models against different criteria that their own customers would face with that use case. And so we built this Red-Team, which allows you to test every single model in one of three ways. Our signature attacks, which are effectively a variety of attacks that fall across all areas of arm within models. Secondly, our operational attacks and which are legacy style attacks like DDoS attacks and InfoSec, but applied to an AI setting. So for example, you could ask something like, what's the history of the world in 30 million lines that will sail through traditional DDoS detection, hit the model, cause that model to generate a huge response, which will block the GPU and give you a denial of service attack. But lastly, our biggest, I guess, breakthrough was our Agentic warfare technology. This is leveraging what a lot of people every newspaper is talking about at the moment, agent AI, but using that to find flaws and gaps in existing AI models, exposing those so you can understand how to deal with those threats.>> Talk about Agentic warfare, love that term in the sense of it plays off the agent wave that's hitting the hype is off the charts. We see agents, we love agents, we see this will be a preferred future, very quickly. Agentic describes the infrastructure, but Agentic warfare implies there's some automation involved. Can you define what Agentic warfare means?
James White
>> Yeah, absolutely. So agents can be thought about as a virtual representation of a person doing a job, where you have a pen tester, somebody that will try find ways of breaking or manipulating a piece of software, what we do is we create agents, multiple different types of agents, some are researcher agents that research the latest threats, some are agents that will try different types of prompt injection, jail breaking, et cetera. And what you do is you give it an intent, so what is the thing you were trying to get the model to do that it should not do? So an example might be give me John's social security number or John's IBAN for his bank account or whatever the case may be, and our agents will try millions and millions and millions of different techniques that it figures out itself to get that information. And another agent will verify that the information it received meets the intent that you gave it, thus completing the attack. And so you use this for good purposes to find flaws in models before you release them to production.>> I love the for good angle, just to kind of throw a side note here, we'll get back to the leaderboard in a second. This symmetry trend is happening. It used to be very asymmetrical, the bad guys had an advantage, we've seen that. You're seeing a lot of that symmetry also in media right now with disinformation. Cyber security is setting the table for some of the best practices across all industries with data. What's your comment on that? Could you share your thoughts? What's your reaction to that?
James White
>> Yeah, I think it's a really good point. What we have, we're seeing the investment in gen AI from enterprises go into huge money. So on the inference side, companies like Anthropic publishing their revenue and we're getting into billions of dollars. So there's a huge amount of enterprises building applications against that right now. When those applications go live, which have already started happening, now you've got a brand new attack surface. And so cyber security is leading the way in understanding those threats, but also it needs to embrace generative AI to bring to the table to fight fire with fire.>> I love it. I mean, I talked to a lot of CISOs and security folks. It's like gen AI, oh, that's great, it's just another application to us. We look beyond the hype, we have that bar. So this leaderboard is super important, I want to get now into the news. First of all, there'll be a lot of attention, everyone's going to want to know, how do I rank in this thing? How do you measure it? So take me through the models you guys ranked, some of the advanced techniques, you touched upon Agentic warfare. What metrics did you guys pick to settle around that composes the leaderboard?
James White
>> Yeah, no. So first of all, we pick the top models. So the models that are the highest quality models in the world because that is typically what enterprises use to build their products against. We can test any model, but we began with the top 12. In those, we've seen that, Anthropic we've already spoken about are the safest model. Now their intent as a company is to build safe generative AI models. So they've really achieved that. And our hats are off to them, they've done a great job. With all of these models, you can of course find threats that do break through. So it's important to know what those are and protect against them. So you still need the defensive element. But before we came along, there were certain metrics that were used like attack success rate, that's a very basic metric. So I gave a prompt to give me a response, but there's a lot of things missing there. So the complexity of an attack or the intent of the attack was not factored in. We factor in all of those things. And effectively, severity is one of the biggest things. If you get a model to tell you how to steal cookies from the cookie jar, I don't think any CISO is going to care about that. But if you manage to steal proprietary information, like the often used story of stealing the recipe for Coca-Cola, well then I'm sure the CISO will care a lot about that. So severity is one of the scores that we use to build our CASI score. Complexity is another, so how advanced is this attack? Old techniques like DAN, we measured their decay rate. So how decayed their effectiveness is against the latest models. A DAN33 attack would be very capable of breaking a GPT-3.5 model, but not so Claude 3.7. And then lastly, the defensive breaking point, how much effort and cost do you have to put in to achieve this attack? Is it nation state levels of investment? Or can a script kiddie do this from their bedroom?>> So the CASI, that's your score, that's the final score. That stands for what? CalypsoAI Security Leaderboard Index?
James White
>> Yeah, CalypsoAI Security Index, that's right.>> Yeah, that's like the major score. That's like the final answer, right? That's your grade-
James White
>> Yeah.... >> on your test. Okay. You mentioned other things risk, so like the risk management piece and value cost, those are the other areas. So overall score, risk and value.
James White
>> That's right. So risk and value, if you're using a model for something that's basic inside your company, there's no real danger. You can typically get away with a cheaper, less secure model. So just because a model has a low CASI score does not mean it doesn't have its fit for purpose areas that you can use in your enterprise. But if you're putting something out time in production that's customer facing, you really want to make sure your CASI score is high and understand the type of threat actors that will be trying to break into your system. You mentioned fraud earlier, if it's organized crime or if it's maybe health records for United States citizens, it may be a nation state trying to break in, and therefore you need a really high complexity score, really high CASI score and you need to make sure that the severity level is very high as well.>> All right, before we get into the CASI scoreboard, what's the end game here? Obviously, we said at the top here, companies want to maintain that security posture, those standards, but bring in innovation, they don't want to be a blocker. But it's also a people business, craft is big, manual work is done. That's a big theme in AI right now. What problem does this solve besides protecting the companies? What does it do for the teams? Does the Red-Team benefit? What is the benefit when you look at this Agentic warfare method and some of the things you guys do? I'm sure it fills gaps in all those things, but what's the end game? What's the benefit?
James White
>> So the very first one is apples to apples comparison. So you can actually compare models against each other in a meaningful way. Secondly, we make this leaderboard publicly available. So regular users, folks that aren't doing this for their work, they just want to use it in school, in their personal life, et cetera, they can go there and see which provider to choose based on the type of things they use the model for. One of the other major benefits is when you're a Red-Team professional and you work in a company that has a model that's chosen for you because of your enterprise vendor agreements. So let's say you're using Microsoft, you might be on the Azure stack using GPD or Phi, whereas if you're on the Google stack, you might be using Gemini or Gemma. And so you might be forced to use a specific model family and you need to understand out of that family, which one is best for my use case or which ones am I worried about and I want to put a ban order on inside my organization?>> Real quick, Jimmy, on performance, do you guys measure performance at all? Or is that performance of the tests? I noticed in the benchmarks and in the leaderboard, there's a performance column, what is that about?
James White
>> Yeah, so we do, of course, measure performance. So the average performance is the, so things like MMLU cetera, how good a model is at a purpose. So the really important thing about that is you want to get that trade off right. So you want to have the best combination of quality performance to security performance. And so we give you the average performance rating as well so that you can make that decision without having to go to other leaderboards.>> Got it. All right, let's get into the leaderboard. Looking at the models here, let's go look at the rankings, Anthropic, Claude 3.5's on it, 96.25. What's the measure on the overall CASI score mean? Is it higher the worse it is? And what's the bar... If that's an A or an F, whatever you want to look at it, what's the C-
James White
>> Okay, so that's an A.>> What's the average? Where's the line between I could be indifferent? Or where's the concern? Where's the red, yellow, green kind of vibe here?
James White
>> That's a good question. So first of all, the higher the score, the better. So the higher the score, more secure your model is. What's interesting about the leaderboard as it stands, and this changes on a daily basis, we update the leaderboard weekly. I can already give you a sneak peek into next week, and that Claude 3.7 has come in hot at number two. So it's actually coming in less secure than its predecessor 3.5. So it's interesting as companies try to release new models to do different things for more people, sometimes the security score will go down on that model, but the top three are the ones to really focus in on. They have quite a lead over the rest of the table. And so you can see quite a gap from Anthropic, Microsoft, and again, Anthropic, they're the two companies creating the safest models on planet Earth at the moment. And something that we, I think can't but speak about DeepSeek-R1 got a huge amount of bad press recently. Obviously, the application for DeepSeek, when you're giving your information, it may end up being in China. The actual model itself is quite good and a lot of the reports were not as factual as they may should have been. DeepSeek-R1 is mid-table on our leaderboard and it deserves that position and it's one of the only reasoning models, so the latest breed of models to appear on the leaderboard.>> Got it. So there's some context to these models. Okay, so DeepSeek has got a 74.4, that's R1, Distill-Llama-70B, that's a good score then for them. Because they're reasoning, so they would be the leader of the reasoning models or did I get that right?
James White
>> Yeah. No, you're right. They're up there. One of the things I would take into consideration, the reason their score is lower is that whilst it's very good at blocking attacks in the main, it does give quite a lot of false information. If you ask it questions about maybe historical Chinese incidents, it won't give you the answer that you expect from history. If you ask questions about the US historical, it'll give maybe inflamed answers about those things. So whilst it's good at fending off attacks, it does provide quite a lot of false information.>> Okay. I noticed you guys stopped at 67, Alibaba's at the bottom. Was there other ones that didn't make the cut? How do you guys look at what doesn't make the list? Is there a threshold? Do you just cut them?
James White
>> Yeah, so you'll see that GPT-3.5 Turbo is second last on the list. So in next week's leaderboard, it actually falls off the list completely, it drops out of the top 12. We're keeping it at a top 12 for now. What we want to see is models fall off and promote. So what's really interesting is all of these models, even if they don't change the new attacks that come out on a weekly basis, cause them to find new gaps. So something that might be performing really well this week may perform poorly next week, just like what happened with 3.5.>> Okay, so I love the Agentic warfare concept, you guys have Agentic remediation going on too. I mean, or are you measuring how the model's going to respond? Because a lot of the reinforced learning is coming, obviously reasonings there. I mean there's a big debate between causal AI and I call it the propensity side of it where it really has to do its own less causal, more reasoning, right? So what do you see on that?
James White
>> Yeah, so it's really good observation. We have effectively got a Purple-Team coming at H2 and that is the blend of Blue and Red obviously, which takes new threats discovered by Red, creates custom rules using our custom defense system, which is based on generative AI, and then, and this is a critical piece, it doesn't auto deploy to production and we have a human in the loop there. So we are our big advocates about safety and security in AI and we have to walk to walk. So if we're about to put out customers generated by an AI fully without a human pressing go, that's a bad step. So that human in the loop step, even though we try automate as much as possible, we're using AI to create these customer defenses. We need a human to say, "Yes, this is good enough for live." And so that's the one part that isn't fully automatic.>> All right, so I have to ask you, where do people find this? Certainly we love it, great work by the way, congratulations to you and the team. Say hello to Neil for me over there, good work. Hey, if you've got a widget, we'll put it on our website. Love leaderboards. Especially, it's a great benefit to the community. Where do we find this?
James White
>> So calypsoai.com, we've got our leaderboard link right at the top and it'll be updated on a weekly basis. So once you've seen it, please come back and check it out.>> All right, give us an API, we'll put it on our site. All right, now let's just before we end, I want to just get your thoughts on impact of this. So you got a leaderboard, you guys are doing your job. Congratulations, thank you very much for this great service. Now, the people out there building for the enterprise, there's a huge startup ecosystem emerging, they want to sell into the enterprises and they have a big high bar to hurdle over from security standpoint, POCs, tons of one-year contracts going on. So these startups... And other companies, by the way too, that want to integrate with these enterprises, they need to integrate into what they already got. Not a lot of people want to change their... I mean their tools and their platforms don't mind changing process, this is our observation on our research side. So the question for you is how does this leaderboard help them run their POCs, benchmarking? Because benchmarks are crazy these days, everyone's the fastest of everything. So I'm a developer, I might want to use a DeepSeek for the reasoning, or the Anthropic Claude 3.7, but I'm going to run that on a workload that I built on a stack that I built running on say a Dell AI factory or an NVIDIA machine or an HP... Whatever it is, I got to run that software somewhere. So what's this do for those folks trying to evaluate the efficacy of their workloads?
James White
>> Yeah, so I guess the very first thing is a lot of folks download their models if they're hosting them themselves in their own hardware from Hugging Face and other vendors like that. And so the first thing you can do is you can evaluate that model without deploying it to figure out if it's safe for your use. Second thing, you've got a great starting place where we're providing you with the list of safest models. If any of those models fit your use case, just use them straight away and that'll save you a bunch of time. But what's critical is that this is not a one and done, it's CICD, you need to integrate this into your pipeline. So as these models iterate, as the threats iterate, that cat and mouse game does not end, you need to continuously test this as you redeploy and add new applications to your stack. One thing I would also point out is that latency is becoming an issue. So when you're using generative AI models, if you're using publicly hosted models, there's peak times that you might've even noticed yourself, responses can be delayed. Understand that those impacts will also impacts your customer when they're using your product. And so there's many areas from testing before you go to production, to security and defense in production, and also understanding impact to your customers based on what happens in the field.>> Jimmy, great to have you on. When I'm in Ireland, we'll have a Guinness, have a pint and get together. Final question for you is, what was the motivation? Did you guys just say, "Hey, we got all this data, we should publish this."? Obviously it's great for the community, as I mentioned, but was it that, hey, we got the Red-Team's product, we're seeing all this, we have the data. Was it more of a contribution to the community? Do you think it's got benefits to the business? And what was the rationale? What was it like sitting around the room saying, "Hey, we should do this."?
James White
>> Honestly, we were building a Red-Team product and we had all this information and we thought, is there anything we can do open source, et cetera here that will benefit the wider community? And one of the things we figured was we had a by-product of all of this information, why not cut some of it off every week, slice it up and put it out on a website. Obviously, it benefits us from a people understanding what we do and how we do it. We get some good promotion from it, but the real motivation was we had this extra by-product of information, let's just use it for good.>> Awesome. Jimmy, thanks a lot for coming off and breakdown and unpack and do a deep dive on the news. Really appreciate it. CalypsoAI launching Security Index, it's the first comprehensive safety ranking of all the major generative AI models. Again, congratulations. We love it. Thanks for coming on.
James White
>> Thanks, John. Pleasure.>> Okay, I'm John Furrier here in the Palo Alto Studios bringing all the news to you here in the AI world. Every day, something's happening. Every day, the models are coming out with more advanced features, leapfrogging the other ones. It's just the game of leapfrog. More and more benchmarks coming out. We're trying to make sense of it, and that's what we do here on theCUBE, extracting the signal from the noise. Thanks for watching.