In this interview from the Nvidia GTC AI Conference and Expo, Betsy Chernoff, principal AI and product marketing manager at WEKA, joins Ace Stryker, director of AI marketing and ecosystem at Solidigm, to talk with theCUBE + NYSE Wired's Gemma Allen about why exploding context memory demands are creating an entirely new tier of storage in AI infrastructure. Stryker explains how Nvidia's CMX platform reflects a fundamental shift — storage has moved beyond feeding GPUs and housing shared data to now hosting dedicated nodes for petabytes of KV cache. Chernoff highlights how WEKA's augmented memory grid was built for persistent KV cache storage, complementing both STX and CMX across current and next-generation Nvidia platforms including Vera Rubin and BlueField-4.
The conversation also explores the practical economics of keeping GPUs running at full utilization. Stryker points to MLPerf storage benchmark data showing a direct one-to-one correlation between storage bandwidth and the number of GPUs a system can keep busy, noting that real-world utilization often falls between 50% and 80% — significant untapped capacity in a supply-constrained environment. Chernoff shares results from a production-grade proof of concept with Firmus that delivered a 6x improvement in tokens per second using WEKA's augmented memory grid, demonstrating how persistent KV cache storage translates directly into throughput gains. The discussion also touches on the global NAND shortage and the evolving shape of buyer personas as AI clouds, model builders and enterprises converge around similar infrastructure challenges. From squeezing more cycles out of landed GPU capacity to building software-defined platforms that bridge today's hardware to tomorrow's AI factories, both guests provide a practical roadmap for navigating the memory-constrained AI era.
Forgot Password
Almost there!
We just sent you a verification email. Please verify your account to gain access to
NVIDIA GTC 2026. If you don’t think you received an email check your
spam folder.
In order to sign in, enter the email address you used to registered for the event. Once completed, you will receive an email with a verification link. Open the link to automatically sign into the site.
Register for NVIDIA GTC 2026
Please fill out the information below. You will receive an email with a verification link confirming your registration. Click the link to automatically sign into the site.
You’re almost there!
We just sent you a verification email. Please click the verification button in the email. Once your email address is verified, you will have full access to all event content for NVIDIA GTC 2026.
I want my badge and interests to be visible to all attendees.
Checking this box will display your presense on the attendees list, view your profile and allow other attendees to contact you via 1-1 chat. Read the Privacy Policy. At any time, you can choose to disable this preference.
Select your Interests!
add
Upload your photo
Uploading..
OR
Connect via Twitter
Connect via Linkedin
EDIT PASSWORD
Share
Forgot Password
Almost there!
We just sent you a verification email. Please verify your account to gain access to
NVIDIA GTC 2026. If you don’t think you received an email check your
spam folder.
In order to sign in, enter the email address you used to registered for the event. Once completed, you will receive an email with a verification link. Open the link to automatically sign into the site.
Sign in to gain access to NVIDIA GTC 2026
Please sign in with LinkedIn to continue to NVIDIA GTC 2026. Signing in with LinkedIn ensures a professional environment.
Are you sure you want to remove access rights for this user?
Details
Manage Access
email address
Community Invitation
Betsy Chernoff, WEKA, & Ace Stryker, Solidigm
In this interview from the Nvidia GTC AI Conference and Expo, Betsy Chernoff, principal AI and product marketing manager at WEKA, joins Ace Stryker, director of AI marketing and ecosystem at Solidigm, to talk with theCUBE + NYSE Wired's Gemma Allen about why exploding context memory demands are creating an entirely new tier of storage in AI infrastructure. Stryker explains how Nvidia's CMX platform reflects a fundamental shift — storage has moved beyond feeding GPUs and housing shared data to now hosting dedicated nodes for petabytes of KV cache. Chernoff highlights how WEKA's augmented memory grid was built for persistent KV cache storage, complementing both STX and CMX across current and next-generation Nvidia platforms including Vera Rubin and BlueField-4.
The conversation also explores the practical economics of keeping GPUs running at full utilization. Stryker points to MLPerf storage benchmark data showing a direct one-to-one correlation between storage bandwidth and the number of GPUs a system can keep busy, noting that real-world utilization often falls between 50% and 80% — significant untapped capacity in a supply-constrained environment. Chernoff shares results from a production-grade proof of concept with Firmus that delivered a 6x improvement in tokens per second using WEKA's augmented memory grid, demonstrating how persistent KV cache storage translates directly into throughput gains. The discussion also touches on the global NAND shortage and the evolving shape of buyer personas as AI clouds, model builders and enterprises converge around similar infrastructure challenges. From squeezing more cycles out of landed GPU capacity to building software-defined platforms that bridge today's hardware to tomorrow's AI factories, both guests provide a practical roadmap for navigating the memory-constrained AI era.
play_circle_outlineSolidigm's NAND and DRAM Balance for RAG/Agentic AI: Managing Cost, Performance, and Storage Complexity for Large Inference Datasets
In this interview from the Nvidia GTC AI Conference and Expo, Betsy Chernoff, principal AI and product marketing manager at WEKA, joins Ace Stryker, director of AI marketing and ecosystem at Solidigm, to talk with theCUBE + NYSE Wired's Gemma Allen about why exploding context memory demands are creating an entirely new tier of storage in AI infrastructure. Stryker explains how Nvidia's CMX platform reflects a fundamental shift — storage has moved beyond feeding GPUs and housing shared data to now hosting dedicated nodes for petabytes of KV cache. Chernoff hig...Read more
exploreKeep Exploring
What recent developments in AI storage—including Nvidia's CMX/STX announcement—are changing the role of storage in AI clusters and creating a new tier for context memory/KV cache?add
Why do large language models require petabytes of context memory at scale, and why are solutions like CMX (rather than relying only on GPU/CPU memory) necessary?add
How is KV-cache tiering evolving in AI systems, and how are approaches like NVIDIA’s positioning, augmented memory grids, and partnerships (e.g., Solidigm) addressing the gaps between hot (HBM/DRAM) and cold (SSD) storage?add
What do discussions between NAND and DRAM vendors about their roles in AI infrastructure typically focus on?add
>> Welcome back to theCUBE here live on the ground in San Jose. It's Nvidia GTC 2026. And I'm here at the WEKA and Solidigm popup where the food is great and the company is even better. Joining me now is Betsy Chernoff, Principal AI and Product Marketing Manager at WEKA and Ace Stryker, Director of AI Marketing and Ecosystem at Solidigm. Welcome guys.
Betsy Chernoff
>> Thank you.
Ace Stryker
>> Thank you. Great to be here.
Gemma Allen
>> So much happened yesterday in a day. I mean, there's been so much happening in the industry broadly, it feels minutely right now. But one thing was clear yesterday, memory is suddenly having a moment. Both of you guys play and work your day-to-day lives in the memory space. Talk to me a little bit about what has fundamentally changed for you and your businesses over, I guess, the last short while.
Ace Stryker
>> You want to take that one first?
Betsy Chernoff
>> Sure. So I think what makes probably the most sense is really just talking about how we've evolved in AI today, right? And if you think about it from a level of where we started from even a year ago, people were just doing single shot prompts. So they were going to these AI chatbots, typing in one thing and getting one response, and that was the way of the world. And for memory systems at that level, it's much easier to adjust and adapt to that. But as we've grown, you've seen things like multi-turn, concurrency, many users, many different rounds of conversations. And then in addition to that, the context lengths themselves have grown. So think about the Cursors of the world and the Claude, Anthropic or Claude Code. All of these have exponentially increased the amount of memory required for these systems. And that has impacted, I think, both of us fundamentally at the core of what we do and how we build today.
Ace Stryker
>> Yeah. It's absolutely a torrid pace of development here. I heard somebody in one of these sessions recently say last week was a very busy year in AI, right? And that's kind of how it feels. It's just things are coming so quickly. From a storage perspective, the news of the quarter really seems to be around context memory. Like Betsy's talking about. It sort of started in January at CES when Nvidia announced this thing called the Inference Context Memory Storage platform, which was a mouthful. Thankfully they've simplified that branding this week. I think they're calling it STX now. But essentially, what's-
Betsy Chernoff
>> One quick correction.
Ace Stryker
>> Yeah.
Betsy Chernoff
>> CMX is actually what they branded it.
Ace Stryker
>> CMX. Thank you.
Betsy Chernoff
>> So STX is a larger platform, but CMX is the context of memory portion.
Ace Stryker
>> I'm still getting my acronyms.
Betsy Chernoff
>> There's a lot.
Gemma Allen
>> There's a lot of acronyms.
Betsy Chernoff
>> ABC acronyms doing right now.
Ace Stryker
>> But if you look at what jobs has storage traditionally had in an AI cluster, it's been in really two places. It's either been in the GPU servers to keep the GPUs humming along at high utilization with a really hot data, or it's been over the network in shared storage and really their density and efficiency is the name of the game. What's new now is the third job. It feels like storage kind of got a promotion this year, right? And that third job is new dedicated nodes specifically for storing context memory or KV cache. That's a completely new tier of storage in an AI cluster. And frankly, the market was already under siege and feeling intense demand before that announcement. That demand was not baked into anybody's forecast a year ago. And so here we are in a situation where now in order to scale and to continue to make AI models and outputs increasingly valuable for enterprises, we're going to have to come to terms with a whole lot more data, right? And be able to store that with a combination of world-class hardware and software.
Gemma Allen
>> Building on that and on some of the terms we hear. One term which I think we can all relatively understand is memory wall, right? We've heard Jensen say that a couple of times. It's clear that the future of AI factories involves solving for this. Talk to me a little bit about what exactly that means in practice and how it's also shaping the future of workloads in a world of context memory.
Ace Stryker
>> Sure. So if you think of how folks interact with an AI model in the most general sense, generally, take a large language model as an example. You put in a prompt, there's some input, there's some additional data that's attached to that. And all of that is visible to the model and considered as the model runs inference and generates the output, right? Now, if you ask for an iteration on that first prompt, it's got to remember both your second prompt and your first. Multiply that by 100 for these longer conversations, right? It's got to remember everything you've told it, all the files you showed it previously to continue to deliver more value. Now, factor in the fact that these context windows or the sizes of the individual prompts are exploding to like a million tokens in some cases. Multiply that by the fact that you may have thousands of concurrent users running inference on one system. Before long, you're into the petabytes or dozens of petabytes of context memory that these models need to be able to keep track of to continue delivering value, right? So it's very much not a trivial amount of data. And even the new Vera Rubin stuff that's coming online this year from NVIDIA, look at their NVL72 system. I think between GPU memory and CPU memory, you have like 70 terabytes, which is a lot, right? But when you're talking about context memory into the petabytes, that's where you hit the wall, right? And if you don't have someplace to put that data so that the model can access it later, you're going to have to throw it out and then you're going to have to pay the performance penalty of recomputing it on the fly, which is what this whole CMX approach is designed to sort of mitigate.
Gemma Allen
>> And Betsy, in terms of some of the announcements this week, like STX, tell me a little bit about how that impacts you guys at WEKA. What does that mean for your product roadmap and your customer journey broadly?
Betsy Chernoff
>> So I think one of the amazing parts about all of this is last year at GTC, we announced augments and memory grid and fundamentally that supports not only STX, but also CMX, to your point. Really the idea behind not only this memory wall, but just the area of the world we're in today is having an area where you can have persistent storage of this KV cache, right? To Ace's point, outside of sort of HBM or DRAM systems today in the GPU. And we've been building that way since 2025. Now, with the onset of STX and then also CMX, this gives us a perfect complement to what we've already been building and allows our customers to build and continue to build on NVIDIA with Vera Rubin and BlueField-4. I think these announcements just perfectly iterate where not only the market is going, but also just the vision for this world, right? Traditionally, we talk about storage systems in a very specific way, but now there is an entirely new category of what we're talking about within storage systems. So it's equal parts very exciting for us and also it's something that we've been fundamentally building for a while.
Gemma Allen
>> And in a world of like where we have everything set up to have a storage solution that can be fed quite rapidly, right? Like it's a interactive process per se. What are customers really looking for? Is it reliability? Is it speed? Is it cost? What things are really featuring for you guys in your conversations on the ground?
Betsy Chernoff
>> One of the most interesting things and the way that it has been positioned through NVIDIA and the way that it makes the most sense, when you talk about the tiering of KV cache, it really flows not only through hop state, but also to cold state, right? And traditionally, it's had tiers in HBM and DRAM, like I talked about, and then sort of this cold storage at the end with SSDs in the middle. There was nothing in between, right? So there was no having it ready for reuse and sending it back to hot state quickly. None of that existed before. So today, we now have that opportunity to fundamentally build in a meaningful way. And for us, that's with augmented memory grid and partnerships like Solidigm. But in general, this gives us an opportunity to really talk about where the world is going within AI and how that's being built today that's fundamentally different than it was before.
Gemma Allen
>> And Ace, the world of flash drives generally. What do you say to maybe folks that still think why not just use DRAM or use additional cloud storage? What are those competitive conversations like?
Ace Stryker
>> The answer in AI is always it depends, right? Which is why we have the product roadmap that we do with several different swim lanes. The job of NAND is absolutely to work hand in hand with DRAM to deliver great outcomes, right? There's no question about that. DRAM is always going to be an important part of the infrastructure. But I think what we're experiencing here recently with the swelling magnitude of AI inference data, and that's not just context memory, by the way, that's also RAG data and agentic AI generating sometimes many of dozens of prompts based on the one human inter-prompt at the beginning of the process, right? We're quickly realizing that it's just not economically or technically feasible to keep everything in DRAM, right? And so you have to be smart about what stays there, what moves to the NAND that's in the GPU server, inside the box, what moves across the network, because there's a performance penalty associated with that a lot of the time, right? And what's evolving is a more sort of nuanced and refined tiering architecture, which great software like WEKA helps manage and essentially make that all painless to the user, right? But you've got to orchestrate where all this data lives, because it's just not possible to keep it in one place. And if you do move it outside of or off of your high bandwidth memory on the GPUs, you've got to make sure that it's accessible in such a way that your GPUs can stay busy, right? Nobody wants to drop tens of thousands of dollars or more on compute resources that then run at 50% utilization because they're starved for data, because the storage subsystem isn't keeping up, right? So it's very much a design question and it's about integration and partnership between hardware vendors like us who make the storage devices. And on a solution level, the folks at WEKA, the real geniuses who think about, "Okay, how are their customers and end users demanding data in what quantities? What do the transfers look like? Are they sequential or random? How much, et cetera?" All of that stuff needs to be very deeply thought through in order to have the simple and delightful experience of like getting a useful output from AI when you ask for it.
Gemma Allen
>> So staying on tiering architecture for a second, and I want to ask you something, Betsy, here. There is certainly a maturity element to this, right? There is a readiness element and a maturity element. And we can't say with any great conviction that all enterprises are ready for the inference era per se, right? There's an element of meeting companies to Betsy's quote earlier, where you're at. What are you actually seeing in the way of requirements for storage at different layers of buyer journeys like from enterprise to sovereign cloud? What sorts of patterns are emerging for you guys? What are you seeing specifically, Betsy?
Betsy Chernoff
>> I think one of the most interesting things about this time and place is a number of customers today have infrastructure that isn't estimates, right? They have infrastructure in place that they're building on already. And so for them, it really is, how do I build today so that I can build for the future or modernize to the future? The beautiful thing about how we're building is augmented memory grid and neural mesh work today on the infrastructure that our customer has, and that's because we are software defined. So not only can we help you with, whether it's Nvidia or x86s, whatever the case may be, but we can help you even evolve into that XTX platform with Vera Rubin and BlueField-4, right? And it gives the freedom of choice for these customers. This moment in time, everything is evolving so rapidly. We've been talking about this sort of ad nauseum, but it really is evolving so quickly and customers are putting together what they can do today and also how they can build for tomorrow. So meeting them where they're at is fundamentally just the most important piece that I think we can do in order to help them not only build this in the way that they need to, but also support them for the future.
Gemma Allen
>> I'm interested in a comment you made, Ace, around idle GPUs, right? Because I'm sure it's a broad set problem, but there's also a narrative that there aren't enough GPUs. There will never be enough GPUs. So the idea that there are some GPUs sitting idle anywhere seems almost like ludicrous, right? But what are the actual realities? Like what are you seeing and what is the core clog that is allowing that to happen?
Ace Stryker
>> It's a good question. And there's been some interesting research on this topic. There are white papers available where I think folks have looked across big clusters at what is average GPU utilization in the real world. It depends, right? It's always the answer. But certainly, there are a lot of cases you could point to where banks and banks of expensive high-powered GPUs are running somewhere between like 50% and 80% utilization, which means there's a lot more value you could squeeze out of those. And in the sort of supply constrained environment that we're in now, that's more important than ever, right? Why does that happen? It usually has to do with a bottleneck elsewhere in the system, and I think storage can often be the culprit there. One interesting thing that we looked at last year, there's a benchmark called MLPerf storage. It's sort of an industry standard for measuring the performance of your storage subsystem in an AI cluster. And I went and I looked at all 130 or whatever submissions there were in the last round of that test and I graphed them all out. And what I observed was essentially like a straight line, one-to-one correlation between storage bandwidth and the ability to keep more GPUs busy, right? In other words, there's not really a way off of that line. You can't squeeze more out of your GPUs if they're being starved for data, right? And so making sure that the storage devices you have are performant, the software is modern and efficient and does the appropriate things in terms of tiering and putting the right data in the right place. All of that is part of answering the question, how do we get more out of those GPUs?
Gemma Allen
>> Talk to me about the buyers and the shape of those buyer personas per se. Are you seeing a shift in terms of who you guys are selling into as this, I guess, industry both converges and collides in certain directions? How are things changing?
Betsy Chernoff
>> So one of the pieces that I've found that we've seen across the way is not only are people fundamentally or customers fundamentally impacted by throughput across the system, but in addition to that, they're all trying to figure out and solve for the same problem. We talked about it before with the memory wall. We just had at GTC this week, we announced not only are we partnering with NVIDIA around STX, but Firmus has done a POC with us, production grade for augmented memory grid. And within that, what we've seen for them is that they're getting upwards of 6x improvement in tokens per second, and that's huge, right? So when we talk about the system and we talk about how to impact it in a meaningful way, it really is that throughput number. And while I'm talking about an AI cloud, this doesn't just impact AI clouds. We know that. And we're seeing absolutely, to your point, a convergence between AI clouds, model builders, and then enterprises as they begin to adopt this as well. Where I think this will become incredibly interesting is particularly around what we call an AI factory and what that shape takes hold of in the future and what it is today. Predominantly today, it really just is around AI inferencing or AI workloads, but where that can go in the future, if you think about an enterprise, I think that spans much more than just one particular workload. It really does start to tap into enterprise workloads and that's where it gets interesting.
Gemma Allen
>> For sure. Well, one thing is clear this week. I think both of you and your companies are going to play a role in the future of AI factories. Talk to me, final question, what's ahead for you both here on the ground at San Jose for the next couple of days and what's ahead for the rest of 2026?
Betsy Chernoff
>> One of the most important things today I think we can talk about, and it can't be understated right now, we know that there's a NAND shortage, and we know that this is happening today. When we talk about numbers for token throughput, and we talk about things like customers never having to recompute another token unnecessarily, all of this impacts your ROI, but it impacts the systems that we have today and where customers can build, and that impacts power supply, and the NAND shortage, of course, because getting and sucking the most out of what you have today is so important. And that includes our partnership with Solidigm as well, because we can't do this without you guys.
Ace Stryker
>> Yeah. The challenge is real, right? And there's not necessarily a short-term solution to doubling NAND supply globally, right? It takes a while to spin up a new NAND fab. And you've seen announcements in the news, it's obviously being worked on, right? But in the meantime, it feels like the golden opportunity here is to figure out how to do more with the resources that we do have available, right? Whether that is the bit density in our factory, whether that is squeezing more cycles out of GPUs that are already installed, right? There's a lot of landed capacity and data centers that probably is underutilized, right? And I think there's opportunities through some of the innovations that we're talking about this week at GTC to realize new levels of efficiency to unlock scale until we can get to that place where NAND and DRAM and everything is abundant and easily obtained again. That's going to be a thing.
Gemma Allen
>> Well, guys, wonderful to talk to you both and to learn more about your companies. Thanks for having us here at this wonderful popup in San Jose.
Ace Stryker
>> Thank you.
Betsy Chernoff
>> Thank you.
Gemma Allen
>> I'm Gemma Allen here at the WEKA and Solidigm popup at Nvidia GTC 2026, live from San Jose. Thanks so much for watching.