In this interview from the Nvidia GTC AI Conference and Expo, Andy Pernsteiner, field chief technology officer of VAST Data, joins theCUBE's John Furrier to discuss why storage has emerged as the critical bottleneck — and enabler — for AI factories operating at scale. Pernsteiner explains how VAST Data's multi-year partnership with NVIDIA, from the early GPUDirect Storage collaboration to the newly released Dynamo inference engine, is reshaping how GPUs consume and offload data. By moving previously computed KV cache attention data off expensive GPU memory and onto storage, VAST is delivering a 10X improvement in inference capability from a single GPU server — freeing cycles for active sessions and the rising tide of agentic workloads.
The conversation also explores how VAST Data is lowering the barrier to enterprise AI adoption through open-source foundation stacks built on NVIDIA blueprints, spanning document ingestion pipelines to video search and summarization. Pernsteiner highlights a recurring pattern among large enterprises: AI initiatives stalling in pilot phase because roll-your-own RAG systems cannot meet organizational security and multi-tenancy requirements at scale. VAST's integrated policy model addresses that gap, enabling regulated organizations to move from experimentation to production. The discussion also touches on the economics of intelligent tiering across GPU classes and the shift toward disaggregated infrastructure, where VAST's global namespace and data engine orchestrate compute jobs closest to the data. From tripling exabyte deployments year over year to preparing for a world where agents outnumber users by orders of magnitude, Pernsteiner provides a practical roadmap for building AI infrastructure that is fast, secure and ready to scale.
Forgot Password
Almost there!
We just sent you a verification email. Please verify your account to gain access to
NVIDIA GTC 2026. If you don’t think you received an email check your
spam folder.
In order to sign in, enter the email address you used to registered for the event. Once completed, you will receive an email with a verification link. Open the link to automatically sign into the site.
Register for NVIDIA GTC 2026
Please fill out the information below. You will receive an email with a verification link confirming your registration. Click the link to automatically sign into the site.
You’re almost there!
We just sent you a verification email. Please click the verification button in the email. Once your email address is verified, you will have full access to all event content for NVIDIA GTC 2026.
I want my badge and interests to be visible to all attendees.
Checking this box will display your presense on the attendees list, view your profile and allow other attendees to contact you via 1-1 chat. Read the Privacy Policy. At any time, you can choose to disable this preference.
Select your Interests!
add
Upload your photo
Uploading..
OR
Connect via Twitter
Connect via Linkedin
EDIT PASSWORD
Share
Forgot Password
Almost there!
We just sent you a verification email. Please verify your account to gain access to
NVIDIA GTC 2026. If you don’t think you received an email check your
spam folder.
In order to sign in, enter the email address you used to registered for the event. Once completed, you will receive an email with a verification link. Open the link to automatically sign into the site.
Sign in to gain access to NVIDIA GTC 2026
Please sign in with LinkedIn to continue to NVIDIA GTC 2026. Signing in with LinkedIn ensures a professional environment.
Are you sure you want to remove access rights for this user?
Details
Manage Access
email address
Community Invitation
Andy Pernsteiner, Vast
Andy Pernsteiner of VAST Data speaks with theCUBE Research hosts at NVIDIA GTC '26 in the VAST Data VIP Lounge about the role of storage in artificial intelligence infrastructure. Pernsteiner outlines VAST Data's integrations with GPUDirect Storage, work on Dynamo and KV cache tuning for inference, open source foundation stacks and deployment playbooks and how these capabilities support training, multimodal pipelines and agentic workflows at scale. They emphasize storage as central to maximizing GPU utilization and minimizing idle time.
Pernsteiner highlights that offloading large language model attention state to a KV cache yields a tenfold increase in inference throughput. They describe how VAST Data's global namespace, data-engine orchestration and integrated policy model enable organizations to scale pilots into production while meeting security, cost and performance targets. The conversation addresses practical considerations for adoption, including deployment playbooks, open source integration and optimizations for inference and training workloads to improve scalability and system efficiency.
In this interview from the Nvidia GTC AI Conference and Expo, Andy Pernsteiner, field chief technology officer of VAST Data, joins theCUBE's John Furrier to discuss why storage has emerged as the critical bottleneck — and enabler — for AI factories operating at scale. Pernsteiner explains how VAST Data's multi-year partnership with NVIDIA, from the early GPUDirect Storage collaboration to the newly released Dynamo inference engine, is reshaping how GPUs consume and offload data. By moving previously computed KV cache attention data off expensive GPU memory an...Read more
exploreKeep Exploring
What are the key takeaways from GTC about AI infrastructure—especially the growing importance of storage, the need for low latency, and the rapid increase in data volumes—and how are these trends affecting deployments?add
Why is NVIDIA's Dynamo (and its KV-cache/offload approach) an important component for accelerating LLM inference and improving GPU utilization?add
How does VAST Data approach storage for AI workloads, and what differentiates its solution from traditional tiered storage?add
>> Hello, I'm John Furrier with theCUBE. We are here at GTC out in the VAST Data VIP Lounge, getting all the action on the ground here. CUBE's been here. The whole team's been here getting all the data. Big movement in terms of the accelerated computing, AI factories and the entire storage stack, the memory layer and the agent layer real time all happening here at GTC. And everyone's buzzed because the future is looking really bright and there's so much demand. Andy Pernsteiner is here, field CTO of VAST. Great to have you back on theCUBE. Thanks for coming on.
Andy Pernsteiner
>> Thanks for having me, John.
John Furrier
>> The demand curve is so hot right now that everyone's buzzing, their heads are spinning. But if you look at what was being said here at GTC, Jensen laid out a lot of stuff, but things he highlighted on was things can't be slow. A human can wait a second, but data, AI can't wait. It's got to be fast. This is the number one thing that's going on. You guys are doing a lot of work and the storage is playing the biggest role. In my opinion, it's the biggest thing that's changed from last year to this year functionally as it's becoming more important. I mean, memory is memory. We all know, we want more memory, but storage has changed a lot in its role and position for NVIDIA and overall ecosystem.
Andy Pernsteiner
>> Well, we actually feel like it's been that way all along. It just takes a little while for it to become popularized and people to notice it. The clients that we've been working with over the last several years who've been training AI models and building pipelines, data's always been at the center of what they needed. I think the mainstream is catching up with it now. Jensen announcing these kinds of things on stage obviously lends a lot of credibility to that. Part of it also is because we're seeing it tripling in terms of the number of exabytes that we're going to be deploying year over year compared to last year, and that's pretty significant. And I think it's a combination of a lot of factors. The type of data that people are analyzing now to build models is changing from text to multimodal and beyond. And I think a lot more general purpose mainstream organizations are realizing that they need to harvest the entirety of their data to get their models right.
John Furrier
>> I think it's fair to point out, and when I mentioned big change, I think it is a mainstream awakening. But we covered VAST's launch a few years ago when you guys launched the company. And I think the AI operating system wasn't the worry, I think it was data platform operating system, but basically it was vectoring into AI. So to give you guys credit, you have been doing a lot of big successful deals with the big AI infrastructure builders. So you're kind of in the flow on the trajectory. So what, in your opinion, should people know about the GTC when it relates to storage? What's actually happening? What is NVIDIA saying when they say this is the role of storage?
Andy Pernsteiner
>> Well, I think NVIDIA's primary goal is to accelerate innovation and to make sure the GPUs are being consumed as much as possible. And if you have an idle GPU because the network is slow, well, that's a problem. Well, NVIDIA, as you know, provides large scale data center networking. So they have control over that situation. They don't have their own implementation of a storage platform and they needed to make sure that whatever they chose to use and whatever they worked with their customers to use would be able to satisfy the demands of GPUs. So the first layer is, obviously, you need to provide the bandwidth. Early on, I think about five or six years ago, we partnered with Nvidia to work on Project Magnum or what's known as GPUDirect Storage, because our platform natively supplied NFS over RDMA as an option and gave very low latency access for GPUs to pull data into memory. And that's been sort of the underpinning of a lot of the technologies that have come after. More recently, we're working with them on their Dynamo inference engine, specifically around tuning of KV cache as it relates to offloading attention data from the GPU to compute off into a cache that can grow exponentially and allow for inference, not just for users who are using chatbots, but also for agents who are constantly throwing inference requests at GPU servers. It allows the GPUs to be utilized better for what they're good for and allows the offload of those expensive GPU cycles onto storage.
John Furrier
>> Yeah. NVIDIA announced the availability of Dynamo 1.0. You mentioned that. And talking to Charlie Boyle earlier today, he said, "We don't want to do storage." So KV cache, Dynamo is essentially the orchestrator. It is the operating system, their words, whether it's KV cache or Dynamo, it's all kind of the same thing. Why is that so important? Why is Dynamo an important piece of the puzzle?
Andy Pernsteiner
>> Well, I think what people have found is that initially, when people were talking about accelerating inference, they would focus on what you could do within high bandwidth memory, because obviously, that's the lowest latency. But I think the word latency has a different meaning in the world of LLMs and AI. In the world of storage, you think about latency in microseconds or milliseconds. And you think about milliseconds when it comes to LLM requests, but usually it's a number of milliseconds. It's not one millisecond. And so what we found is that if you can offload previously computed attention data from an LLM session, if you put it onto storage, yes, it's slower to initially fetch it, but what it means is that the GPU can be used for more active sessions. And so if a GPU isn't busy having to recalculate previously computed session data, then it can easily service another request and asynchronously fetch that session data. And that's the work that we've been working with NVIDIA on. And we see a 10X improvement in inference capability out of a single GPU server.
John Furrier
>> The bottom line is more tokens.
Andy Pernsteiner
>> Basically.
John Furrier
>> I mean, more tokens. Talk about foundation stacks. Open source has been a huge theme for this event. I mean, you couldn't have been screaming it louder where Jensen is on the stage. He talked about OpenClaw. He continued to reiterate the vertical he's going after. He talked about horizontal scale, vertical specialization, domain expertise. Open sources fill in the gap there. Talk about foundation stacks and how that relates to the whole blueprint piece and open source.
Andy Pernsteiner
>> So we actually started this project off based on using NVIDIA's blueprints and allowing them to be deployed on our platform. Some of the early initial implementations were around NVIDIA's nv-ingest pipeline for generating embeddings based on documents. We then extended what we worked on to include video in terms of video search and summarization. NVIDIA has a great model that can do video analysis and summarization. And so what we've done is we've created open source GitHub repositories for these pipelines that customers can pull down, manipulate to the way they want, and then deploy them. They're easiest to deploy on VAST, because of course, that's where we did all of our effort, but we're leaving it at an open source because what we understand is that not one size fits all. And so customers need to be able to make adjustments and modifications, but they also need a recipe so that they don't have to learn how to build it from scratch. And so we're giving them not only the recipe, but also the playbook on how to deploy it. And then if they need to make customizations, then they can.
John Furrier
>> So the CUDA stack and the AI stack has been a big advantage for NVIDIA. You guys have had taken a very storage approach and kind of, I won't say storage data platform approach. We've talked about this in the queue many times. How is that going with customers now? As they look for inference, you guys ran the table on training, that's kind of well known. Now they've got inference booming. What's your vision on how you see that unfolding?
Andy Pernsteiner
>> Well, I think what we see, at least in large enterprise organizations, maybe we'll call them more conservative organizations, if you want to say that, they all have AI initiatives. What we've found is that those AI initiatives have a hard time proliferating throughout the entire organization because they get stuck in a pilot phase because they have a challenge scaling them and they have a challenge making sure that they fit the security model of the organization. If you're a bank and you have different trading units within your firm, they don't share data between each other and you need to make sure that data as well as regulated data is protected using the bank security policies. But roll your own implementations of a RAG system, don't give you that without having to have specialized staff and consulting. And so what we've done is we've created a policy model and integrated it with the way that NVIDIA deploys pipelines to ensure end to end security. And so organizations can distribute that across a wider array of their user base and be able to bring things into production. We've talked to many customers this week where they've been in the pilot phase and what's stopping them is this security and the ability to scale. And so we're giving them a path to do that.
John Furrier
>> Yeah. And I love the cheering angle. It's not new to you guys. I mean, but NVIDIA for the first time I see the Pareto charts, you see the segmentation, it's basically levels and you start to see, okay, Vera Rubin is going to have a great performance curve at certain levels. Certainly, you got to pay for that, but then you still have performant other areas. It's tiering. And so you got to have intelligence around data and what workloads to the point you were saying about knowing what not to send where, why I'm going to do that. I mean, why would I want to have all the expensive GPUs, either idle or doing tasks I don't want to do. So you start to see that coming together. As a techie, what's your take on this? Because this to me is massive progress.
Andy Pernsteiner
>> Yeah. I mean, what I see is that being able to intelligently route things to the right place, it comes down to cost and economics, whether it's at the storage layer or whether it's even within the tiers of GPUs. And especially as agentic workflows become more and more mainstream, and especially if the number of agents increases, you need automated ways of making sure that things are running in the right place and have the optimal experience. And so agent to agent communication is an example of an area of development. Another one is ensuring that agents have access to the correct data and the right context as though they were a person, but people are in some ways a much more finite resource compared to the number of agents that we're going to see proliferating. So I think when people think about agents, they might think about very simple use cases, but just imagine multiplying that by tenfold or a hundred fold or a thousand fold, that's the world that we're going to live in.
John Furrier
>> All right. So I'm going to pretend I'm a prospect or customer or just a friend. Explain to me why a VAST Data has got the secret sauce to feed the beast, the GPUs, the systems? What's the secret thing?
Andy Pernsteiner
>> Our secret isn't so secret, right? We wanted to take commodity hardware that other manufacturers created and be able to expose as much performance as possible to any application, regardless of whether the data was new or old. We completely tore down the storage stack that people have considered to be the sort of tiered storage architecture of the past. We've completely flipped that upside down. And so now, we've basically made the standard way that people deploy AI factories and AI model training solutions. It needs to be an all flash, all performance system. We were the first to bring it to market because we were the only ones who knew how to commercialize it in an economical way. In terms of performance, our goal is to always be at the level where NVIDIA is ready, right? As they increase the networking speeds, our hardware vendors support those networking speeds. We partner very closely with them for switching, for network interface cards, for everything to give the best possible chance for those GPUs to be busy. And ultimately, we actually don't see the bottleneck as storage or networking or even GPUs. We see it as human's ability to use those resources. And so another part of the VAST story is to make things easy to use, because then more people will use them.
John Furrier
>> Yeah. You guys got a good story. Final question for you. We talked about this on the panel with Solidigm. First of all, they're doing great with the density. They solve a lot of that problem. You guys bet on the density equation, but the disaggregated infrastructure is a conversation that's in the hallways right now. You're seeing a lot of people talking about, as you disaggregate the centers, the data centers or the factories, we'll see a lot more distributed computing paradigms in the AI infrastructure. Just your thoughts on how that progresses, how that's evolving, how you see it, where it's going.
Andy Pernsteiner
>> I think that where data is generated, where it's processed and trained and where it's inferred, that's going to be a fluid situation. And I think customers don't want to have to constrain what they do based on either where the data is generated, where the compute is, or sort of where the users are. And so our goal is to make it as flexible as possible, but then also as efficient as possible. If you generate data in a factory and you have an AI model that needs to train against that data to build some kind of robotics model or quality control model, you don't necessarily want to deploy all your GPUs to your factory. That's not what it's good for. Maybe the power is too expensive there, but you also don't want to make an extra copy of all of your data to push into you where your GPUs are. And so we've created a global namespace to help handle that. We also have what we call the data engine, which is an orchestration layer for compute, which is also data location aware so that compute jobs can run where the data is the lowest latency and be able to make sure that the customer is getting the most value out of every part of their asset.
John Furrier
>> So you're ready? You're ready for it?
Andy Pernsteiner
>> We're ready. We're ready for the next thing.
John Furrier
>> Andy, good to see you. Thanks for coming on theCUBE here. Special edition in the heat here in the VIP Lounge. VAST Data, again, multiple years in a row, great customer success we've been seeing it, been packed here. Thanks for having us.
Andy Pernsteiner
>> Oh, thank you so much.
John Furrier
>> All right. I'm John Furrier, host of theCUBE. We are here in the VIP Lounge at VAST Data. Thanks for watching.