This session at the WEKA Pop-Up during NVIDIA GTC 2026 examines memory and inference trade-offs and how storage-driven architectures extend GPU context for inference-era artificial intelligence, AI. The discussion focuses on KV cache pressure, agent-driven token demand, storage-backed context memory and heterogeneous LPU and GPU fabrics, with emphasis on practical performance and efficiency for AI infrastructure.
Val Bercovici of WEKA, chief AI officer, and Daniel Kearney of Firmus Technologies, chief technology officer, join theCUBE Research hosts. The conversation, moderated by Gemma Allen of theCUBE, explores how memory shifts from a background cost center to a central enabler for agent workloads and evaluates storage-based extended memory approaches for inference.
Key takeaways include Bercovici reporting a 6.5x increase in effective token capacity from the WEKA and Firmus proof of concept. They argue that storage economics extend memory by orders of magnitude as agent demand grows and that storage-driven architectures provide a cost-effective path to extended GPU context for inference workloads.
Kearney emphasizes operational efficiency, advocating prefilling once instead of repeated prefill to reduce wasted GPU cycles and energy. They note that this approach reshapes buyer journeys and influences sovereign and hyperscale deployments.
Watch for detailed discussion of KV caching strategies, storage-backed context memory and joint proof of concept results that inform deployment decisions for GPU inference and AI infrastructure.
Forgot Password
Almost there!
We just sent you a verification email. Please verify your account to gain access to
NVIDIA GTC 2026. If you don’t think you received an email check your
spam folder.
In order to sign in, enter the email address you used to registered for the event. Once completed, you will receive an email with a verification link. Open the link to automatically sign into the site.
Register for NVIDIA GTC 2026
Please fill out the information below. You will receive an email with a verification link confirming your registration. Click the link to automatically sign into the site.
You’re almost there!
We just sent you a verification email. Please click the verification button in the email. Once your email address is verified, you will have full access to all event content for NVIDIA GTC 2026.
I want my badge and interests to be visible to all attendees.
Checking this box will display your presense on the attendees list, view your profile and allow other attendees to contact you via 1-1 chat. Read the Privacy Policy. At any time, you can choose to disable this preference.
Select your Interests!
add
Upload your photo
Uploading..
OR
Connect via Twitter
Connect via Linkedin
EDIT PASSWORD
Share
Forgot Password
Almost there!
We just sent you a verification email. Please verify your account to gain access to
NVIDIA GTC 2026. If you don’t think you received an email check your
spam folder.
In order to sign in, enter the email address you used to registered for the event. Once completed, you will receive an email with a verification link. Open the link to automatically sign into the site.
Sign in to gain access to NVIDIA GTC 2026
Please sign in with LinkedIn to continue to NVIDIA GTC 2026. Signing in with LinkedIn ensures a professional environment.
Are you sure you want to remove access rights for this user?
Details
Manage Access
email address
Community Invitation
Val Bercovici, WEKA & Daniel Kearney, Firmus
In this interview from the Nvidia GTC AI Conference and Expo, Val Bercovici, chief AI officer of WEKA, joins Daniel Kearney, chief technology officer of Firmus, to talk with theCUBE + NYSE Wired's Gemma Allen about how memory is emerging as the critical bottleneck — and competitive advantage — in the shift from chatbots to agentic AI. Bercovici explains how insatiable agent-driven token demand has transformed KV cache from a background concern into a keynote-level topic in just twelve months. A joint proof of concept between WEKA and Firmus demonstrated that arbitraging storage for memory can yield 6.5 times more tokens from the same GPUs and energy budget — the equivalent of creating five and a half new data centers out of thin air.
The conversation also explores how the rise of LPUs alongside GPUs is creating a heterogeneous inference architecture where prefill, context memory and decode each occupy distinct layers of the stack. Kearney details Firmus's "model to grid" philosophy, designing AI factories where every watt counts — from accelerated compute and thermal management to grid integration. He highlights the company's expansion into Australia through Project Southgate, which will deploy up to 2.7 gigawatts of AI capacity over the next two to three years alongside sovereign cloud deployments in Singapore. From engineering out hardware obsolescence to enabling sovereign nations to maximize token output within finite energy budgets, both leaders outline why efficiency-first infrastructure will separate the winners from the losers in the agentic era.
>> Welcome back to theCUBE here on the ground in San Jose, it's NVIDIA GTC 2026, so much happening. I am here at the WEKA Pop-Up where we're talking all things memory and just what is happening in this inference era of AI. Joining me now is Val Bercovici, Chief AI Officer at WEKA; and Daniel Kearney, CTO at Firmus. Welcome, guys.
Val Bercovici
>> Thank you.
Daniel Kearney
>> Thank you very much.
Gemma Allen
>> And happy St. Patrick's Day.
Val Bercovici
>> Yes, absolutely.
Gemma Allen
>> A lot has changed in a year as it relates to memory as a topic, right? A year ago, it was a cost center topic, not something that people have thought a whole lot about. You guys have had an interesting year in terms of your value in the industry. Let's maybe start there. What made this a keynote topic at GTC this year?
Val Bercovici
>> Yeah, and these are really interesting milestones, GTC. GTC last year, we first started talking about the pressure on AI memory in the form of this thing called KV cache. And it was really a very lonely conversation last year because candidly, the predominant use case for this was agents, and March 2025 had no real agent activity in the industry. Fast-forward to May, Claude code started to appear on the radar and gained viral adoption. By December of last year, all of a sudden agents were mainstream. As Jensen was saying, we're now seeing insatiable, uncapped demand for tokens now driven by agents. So today, March 2026 is a fundamentally different 180 degree different conversation where there's huge demand. It's almost a very specific kind of demand, coding agent demand, placing acute pressure on this thing called KV cache, which is very, very memory centric. And what's interesting we'll dive into is the way you can extend memory by a factor of a thousand X by using storage economics and storage cost of goods to deliver that memory capability that the market really demands right now.
Gemma Allen
>> Daniel, level set for us first, Firmus as a company, I know we're going to talk about the partnership and some very interesting developments and results, but tell our listeners and viewers, what exactly does Firmus do?
Daniel Kearney
>> Thanks for having me. Firmus designs, builds, and deploys, and operates AI factories. One of the peculiar things about our company and our mission is really about creating the most efficient AI infrastructure. And our strategy really focuses from model to grid. We look at every single possible touchpoint where energy is consumed right through from the compute itself, the thermal management, the power and the integration with the grid. And of course, part of that is actually our infrastructure, our accelerated compute layers. We really want to make sure every watt counts in this ecosystem, and I think that's what we saw this year at GTC even more than ever. As scale grows and energy demand for these systems is even bigger, we want to make sure that we're driving the best efficiency possible. That's why this discussion is really apt because there's layers of this at every point where we're, can we do better with algorithms? Can we do better with systems? And again, against this moving target, which is the workloads. And now we've just seen this demand for more and more context to stay in the system for longer and longer.
Gemma Allen
>> On the topic of efficiency, you guys just had a very interesting proof of concepts together, your two entities. Talk a little bit about what this showed from the perspective of results and what it means or what it could signal to the market.
Val Bercovici
>> Yeah, so the promise of this technology sounds almost too good to be true. It's arbitrage storage for memory and get all these win-win benefits. We struggle to actually prove this out. We have this, again, natural partnership, efficiency driven partnership. We were able to do courtesy of the graciousness of having access. You need significant resources, significant CPUs and GPUs to do this. We were able to get two racks fundamentally of GPUs from Firmus, prove this out, and the results were what we expected, which was you're able to get out of the same CapEx and OpEx, the same GPUs and energy costs 6.5 times more, so 550% more tokens. So it's as if in a macro scenario, you just created five and a half new data centers out of thin air to serve agents, to serve the tokens for agents.
Gemma Allen
>> We hear a lot about time to token, right? You're talking about usage and I guess leverage per token. What is really driving customer demand? What is the core number one priority that customers are really, really needing?
Daniel Kearney
>> It's pretty multifaceted, at least from our perspective. I think just customers, obviously, we're looking for a high amount of throughput and the lowest latency in terms of tokens per user per second. And we see that graph being used quite a lot and how it's moving to the right of the curve, right? And the more and more throughput we put through models, the more intelligence we get back, but we also want to retain some of the memory and not forget in models. That's where this extended memory comes in. The workload capability, the ability for agents to discuss with other agents and then communicate and hold context is just going to become more and more valuable as a macro trend. That's at least how we're seeing it and we're seeing customers coming with more and more requests for not just GPUs that are super powerful, but it's the full ecosystem. We see AI as a full accelerated compute fabric. There's many different types of systems in there. We see more CPU now today. We see the launch of LPUs, and part of that extended fabric is going to be a lot more ability to hold more prefilled, more memory, and extend that GPU from not just a super computer, but also to a memory based system.
Gemma Allen
>> Because you raised LPUs, I feel it's interesting to ask, okay? We saw some of the Groq 3 announcements by Jensen yesterday. What are your thoughts from the perspective of competitive positioning, what that means for your businesses in the way of, I guess providing solutions for this LPU era that now suddenly you were also in, right? There's so many things happening at once.
Val Bercovici
>> This is a very fun and very geek technical topic. I think that the macro backdrop here, as Daniel was saying and Jensen showed on stage yesterday or two days ago, was that the consumption went from chat about a year ago GTC last year, towards agents now. And so to your earlier question, chat is very time to first token sensitive, especially for voice agents. It's very awkward when we have that awkward pause. It was even mocked in Super Bowl commercials right this year. So that's where the LPU technology is really ideal because I prefer the term fixed latency even more than low latency. These LPUs, Groq, Cerebras, and others, even TPUs to a certain extent are really, really good at these fixed 100 millisecond and below latency interactions, but that's a certain model size, not the largest models, not agents. And it's a certain query size, certain prompt size, typically 8K or below, not 100K or now the million token prompts and context windows we see. Creating a spectrum of solutions required and integrations required. So alongside the LPUs, we still have to have the GPUs ostensibly for prefill and we have to have this critical element in the middle, this context memory, so that you're not having this excessive amount of traffic over these expensive high performance network between GPUs and LPUs. We can prefill on the GPU, stored at memory speeds on shared context memory storage, and then be able to decode that on LPUs. So it's an elaborate architecture that's forming here, but it's filling these needs of these segments that are emerging as the market matures.
Gemma Allen
>> And in a world where memory suddenly becomes the prominent feature, right? As opposed to just something that happened in the background that you didn't really think a whole lot about, now it's an enabler, it's a competitive advantage. How does that change the buyer journey for both of your companies? Does this become an integrated part of a package for you, for example, Daniel? Does it become something that you suddenly market in a different way? Talk to me a little bit about how the conversation meets the market.
Daniel Kearney
>> Yeah, so when we interact with customers, we're obviously, from an internal perspective, we're looking at making every watt counts. But of course, customers wants performance, they want optionality, they don't know their future in every way, right? Any customer that said they knew that huge amount of memory in March last year and then suddenly the rise of agents came as the way it did. It's obviously, they can't predict that. And there will be more of those moments probably in the future. So this ability to bring in more specific silicon or more extended context to fit workloads, even retrospectively, continue to extend the usefulness of a CapEx investment that happened prior, that's huge. That allows a physical flexibility that didn't normally exist in the past. For us, we don't have, with this type of capability and working with the WEKA team, we can engineer out obsolescence. We can now bring in an existing GPU system or GPU-based system to market that's ready for the next generation of workloads without having to redeploy and throw out the old to bring in a whole new system. That's not necessary. We still keep relevant for customers, and they will evolve themselves. Their workloads change. It's not always easy for them to predict how their customers will consume their product and we could stay relevant and future-proof.
Gemma Allen
>> It's about the experience.
Daniel Kearney
>> Exactly.
Gemma Allen
>> But for you, Val, it's a moment of reckoning, right? It's a moment of reckoning in this industry because you have been talking about something that was a little bit backstage, if not very backstage for a very long time. So how are you thinking about it? It's a great time to be a marketeer at WEKA. How do you think about the message to the market in this moment?
Val Bercovici
>> It's very fulfilling, obviously, to be proven that we were right, but that's not what's valuable to us right now. What's valuable to us right now is offering integration points with great partners like Firmus so that your tenants can deploy this today if they want for maximum leverage. And the reality is, again, if we go back to Jensen's keynote, insatiable demand for token driven by agents, including NemoClaw, OpenClaw right now, which means there's an opportunity to capture that token revenue right now as we speak today, and there's an opportunity cost to wait. And what we're going to find by the time we come back next year and talk is who are going to be the new giants, the new king makers that didn't wait, that seized the day, carpe tokenum or whatever, carpe diem, right? Seized the moment, implemented this today as a tenant on firms and was able to just use the leverage of augmented memory to capture not only 6.5 times the number of amount of revenue users as tenants of Firmus, but for agents. We measure agent now work time in hours and days and that compound benefit of having faster and faster agent turns spread out over tens of thousands of agent turns means drugs are being discovered faster. Cures are being discovered faster. Trades are being optimized better. There's so many use cases right now where there's massive business value. The simplest case is better PowerPoints faster, right? All sorts of valuable things that, again, are available today and the winners and losers will be determined by who sees is the moment right now.
Gemma Allen
>> And Daniel, speaking of use cases and the global remit of Firmus and also this opportunity, right? It's a very global opportunity. You're interesting from the position that you guys have very solid footing in APAC, but a global model.
Daniel Kearney
>> Yes.
Gemma Allen
>> Talk a little bit about what sorts of customers you're seeing, what sort of patterns are emerging that perhaps didn't exist a year ago.
Daniel Kearney
>> Yeah. So we serve customers in Singapore, a mixture of hyperscaler and government customers, and some of them are actually foundational labs that are building out national AI models. So that's a very exciting place to be. They're very much at the cusp of what can be captured in Singapore. We're also growing in Australia with our Project Southgate. We're going to deploy up to 2.7 gigawatts of AI capacity over the next two to three years, which is extremely exciting. And we're going to be exposed to a huge amount of change in the workload space with those customers coming on because they're taking huge tenancies and they're going to be quite active in model building as well. We're very much going to be exposed to the world's leading model builders and leading model providers from APAC, and we're going to build out as a main sovereign cloud, but also a global support vehicle for that. We're excited about what's coming and having optionality in the way we build these clusters and building these clusters that are future-proofed for a world that we are finding it hard to predict right now, at least in the AI space, that's really good. And remember, we're ruthlessly focused on efficiency and this is another way of obviously the energy piece, but also how tokens are used. Carrying tokens, putting them to work where they're needed rather than cycling through unnecessary algorithm loops, this is also important. These are just other features that agents will be doing, carrying context with them, bringing it into the next compute rather than having to re-compute all the time. So it's one thing about tokenization, use of tokens, but using them efficiently and being able to make sure that whatever is produced, whatever what has converted into intelligence is being used the way intelligence it should be and it's cascading forward. We're always thinking about that from that model to grid thing that I told you about the philosophy of the company, every efficiency point counts because it's on us as well to try to drive that as a company.
Gemma Allen
>> Let's talk for a second about the cost of potential wastage, right? Especially in this inference era we're entering into. There is a belief that you just need as many GPUs as possible, right? But the reality is, if you have any GPU sitting idle, it's a huge cost rain to your business. What are you guys actually seeing from the perspective of customers that you talk to day in, day out? How prominent is this challenge and how much control do they really have over a solution in such a complex time of scaling outward, right?
Val Bercovici
>> Yeah. Today, it's very prominent. This concept of efficiency, again, why we're natural partners and sovereign clouds play into this really, really well. If you look at what was emphasized on the keynote stage again, we're at the inflection point of inference and inference is conceptually these two error, two phases, the compute intensive prefill, the memory intensive decode. Today, GPU efficiency, you can have a very busy GPU that's prefilling redundantly over and over and over again because it has a very limited memory window. So as soon as some new prompts come in, all prompts have to be evicted, and then you're inefficiently refilling, re-pre-filling on the GPU side. So does the GPU look busy? Yes, but engineering has a term for this. There's throughput and there's good put. And good put is efficient, useful work, useful work that the GPU is doing. So in our case, conceptually for a long-running agent, instead of prefilling about a million times across 10,000 turns across a day or two, you prefill once. It's called an order of one operation instead of order of N. And that's a magical optimization for engineers that this technology enables. And then you spend all of your time and resources and effort on rapidly decoding, which makes your agents more reactive and responsive, shortens those day long agent runs into hour long agent runs, and gets us to our results faster. And particularly for sovereign AI, I love the bookends you have right now of Singapore and Australia. In Singapore, energy is a very, very finite thing. Real estate, you could argue is fine, but energy is super finite. It's the ability to generate 550% more tokens from that same energy budget is a game changer for sovereign AI.
Gemma Allen
>> It's next Monday, we are all home from GTC, we're recovered and you have some tenants done that think, okay, wow, I would like to leverage that opportunity, right? I want those cost advantages. How will this partnership play out in the market?
Daniel Kearney
>> Yeah. I mean, we're going to be giving customers as much choice and as possibility. And we have a pre-established relationship with WEKA for a number of years now so this kind of POC that we're working on. What I love about WEKA as well as the engineer engagement, they're very much a technical entity just like ourselves so we're out to solve problems and bring value to customers so I think that will continue into the future. And like I said, efficient options for customers and making sure they match what the customers actually need, fit for purpose. I think that's, we're going to always continue to push that wherever we can and make sure that if the local entities in Singapore need to push limits, then we have an option for them right there and then, and of course onto Australia as well.
Gemma Allen
>> Well, nerds love nerds, right?
Daniel Kearney
>> Yes.
Gemma Allen
>> Okay.
Daniel Kearney
>> Gravity.
Gemma Allen
>> Okay. Lastly, and I'm going to give you a question first of all. So based on your predictions last year, if we have any Polymarket or Kalshi enthusiasts listening or watching this, what are your predictions for GTC next year? What do you expect will be a core keynote topic?
Val Bercovici
>> I'm going to be reflective a little bit here because if I were to tell you last year that the 60 plus year mature industry of software development would be completely transformed and upended over the next few months, you would not have believed me last year.
Gemma Allen
>> No.
Val Bercovici
>> So I think the most important thing to be open to is completely unimaginable outcomes a year from now. But the safe and easy options are we're just going to see a proliferation of agents, agent orchestration systems. We're going to be seeing a lot more productivity out of our organizations based on these technologies. And I think we're going to be seeing a lot of proof points because we're seeing a lot of promise of these heterogeneous GPU and LPU architectures. We're seeing them promised basically promoted this year. We're going to be seeing a lot of these really impressive proof points on the infrastructure side that are real, and again, just unimaginable outcomes in terms of business value. Pick your next credentialed industry. It's going to be completely upended and transformed this time next year.
Gemma Allen
>> Less noise, more signal. And what about you, Daniel? What are you predicting for the year out?
Daniel Kearney
>> Well, I'll start with the things that I think won't change, which I think efficiency will become even more important, right? As we scale into energy ecosystems that are becoming more challenged, we've got to respect every part that we can in terms of how we convert that into value. The amount of AI natives that are going to be pushing the limits is going to grow as well. We're going to see probably more of those workloads. This morning I was walking along one of the pavements and I was surrounded by a load of little robots and I just thought that was very exciting. And I think they haven't yet had their time in the way they can and they will. And I think once that type of automation and robotics comes into society, we're going to see GDP moving stuff and they're going to require a lot more fit for purpose AI fabric choices. A lot of it will be memory bound, being able to hold decisions and whether they're reasoning or whether they're handling different things. There'll be an agent in that as well. That'll probably be pushed. I'd really be excited to see where the physical AI in combination with that autonomous or agentic code will go as well. So that will be, those are my two things, a bigger focus on efficiency and probably an enrichment of that stack, that physical stack as well, the software, and the ability for more of this type of stuff that we're doing that will come into a new workload space in physical AI.
Gemma Allen
>> The world that's vertically integration, horizontally aligned, right?
Daniel Kearney
>> Yeah.
Gemma Allen
>> To quote the man itself. But listen guys, thanks so much for coming on theCUBE. Great conversation.
Daniel Kearney
>> Thank you.
Gemma Allen
>> Excited to see what the year ahead looks like for you both, and yeah, keep in touch.
Daniel Kearney
>> Thank you.
Val Bercovici
>> Awesome, thank you.
Gemma Allen
>> I'm Gemma Allen here at the WEKA Pop-Up in San Jose. It's NVIDIA GTC 2026, so much happening here in theCUBE. Stay tuned.
>> Welcome back to theCUBE here on the ground in San Jose, it's NVIDIA GTC 2026, so much happening. I am here at the WEKA Pop-Up where we're talking all things memory and just what is happening in this inference era of AI. Joining me now is Val Bercovici, Chief AI Officer at WEKA; and Daniel Kearney, CTO at Firmus. Welcome, guys.
Val Bercovici
>> Thank you.
Daniel Kearney
>> Thank you very much.
Gemma Allen
>> And happy St. Patrick's Day.
Val Bercovici
>> Yes, absolutely.
Gemma Allen
>> A lot has changed in a year as it relates to memory as a topic, right? A year ago, it was a cost center topic, not something that people have thought a whole lot about. You guys have had an interesting year in terms of your value in the industry. Let's maybe start there. What made this a keynote topic at GTC this year?
Val Bercovici
>> Yeah, and these are really interesting milestones, GTC. GTC last year, we first started talking about the pressure on AI memory in the form of this thing called KV cache. And it was really a very lonely conversation last year because candidly, the predominant use case for this was agents, and March 2025 had no real agent activity in the industry. Fast-forward to May, Claude code started to appear on the radar and gained viral adoption. By December of last year, all of a sudden agents were mainstream. As Jensen was saying, we're now seeing insatiable, uncapped demand for tokens now driven by agents. So today, March 2026 is a fundamentally different 180 degree different conversation where there's huge demand. It's almost a very specific kind of demand, coding agent demand, placing acute pressure on this thing called KV cache, which is very, very memory centric. And what's interesting we'll dive into is the way you can extend memory by a factor of a thousand X by using storage economics and storage cost of goods to deliver that memory capability that the market really demands right now.
Gemma Allen
>> Daniel, level set for us first, Firmus as a company, I know we're going to talk about the partnership and some very interesting developments and results, but tell our listeners and viewers, what exactly does Firmus do?
Daniel Kearney
>> Thanks for having me. Firmus designs, builds, and deploys, and operates AI factories. One of the peculiar things about our company and our mission is really about creating the most efficient AI infrastructure. And our strategy really focuses from model to grid. We look at every single possible touchpoint where energy is consumed right through from the compute itself, the thermal management, the power and the integration with the grid. And of course, part of that is actually our infrastructure, our accelerated compute layers. We really want to make sure every watt counts in this ecosystem, and I think that's what we saw this year at GTC even more than ever. As scale grows and energy demand for these systems is even bigger, we want to make sure that we're driving the best efficiency possible. That's why this discussion is really apt because there's layers of this at every point where we're, can we do better with algorithms? Can we do better with systems? And again, against this moving target, which is the workloads. And now we've just seen this demand for more and more context to stay in the system for longer and longer.
Gemma Allen
>> On the topic of efficiency, you guys just had a very interesting proof of concepts together, your two entities. Talk a little bit about what this showed from the perspective of results and what it means or what it could signal to the market.
Val Bercovici
>> Yeah, so the promise of this technology sounds almost too good to be true. It's arbitrage storage for memory and get all these win-win benefits. We struggle to actually prove this out. We have this, again, natural partnership, efficiency driven partnership. We were able to do courtesy of the graciousness of having access. You need significant resources, significant CPUs and GPUs to do this. We were able to get two racks fundamentally of GPUs from Firmus, prove this out, and the results were what we expected, which was you're able to get out of the same CapEx and OpEx, the same GPUs and energy costs 6.5 times more, so 550% more tokens. So it's as if in a macro scenario, you just created five and a half new data centers out of thin air to serve agents, to serve the tokens for agents.
Gemma Allen
>> We hear a lot about time to token, right? You're talking about usage and I guess leverage per token. What is really driving customer demand? What is the core number one priority that customers are really, really needing?
Daniel Kearney
>> It's pretty multifaceted, at least from our perspective. I think just customers, obviously, we're looking for a high amount of throughput and the lowest latency in terms of tokens per user per second. And we see that graph being used quite a lot and how it's moving to the right of the curve, right? And the more and more throughput we put through models, the more intelligence we get back, but we also want to retain some of the memory and not forget in models. That's where this extended memory comes in. The workload capability, the ability for agents to discuss with other agents and then communicate and hold context is just going to become more and more valuable as a macro trend. That's at least how we're seeing it and we're seeing customers coming with more and more requests for not just GPUs that are super powerful, but it's the full ecosystem. We see AI as a full accelerated compute fabric. There's many different types of systems in there. We see more CPU now today. We see the launch of LPUs, and part of that extended fabric is going to be a lot more ability to hold more prefilled, more memory, and extend that GPU from not just a super computer, but also to a memory based system.
Gemma Allen
>> Because you raised LPUs, I feel it's interesting to ask, okay? We saw some of the Groq 3 announcements by Jensen yesterday. What are your thoughts from the perspective of competitive positioning, what that means for your businesses in the way of, I guess providing solutions for this LPU era that now suddenly you were also in, right? There's so many things happening at once.
Val Bercovici
>> This is a very fun and very geek technical topic. I think that the macro backdrop here, as Daniel was saying and Jensen showed on stage yesterday or two days ago, was that the consumption went from chat about a year ago GTC last year, towards agents now. And so to your earlier question, chat is very time to first token sensitive, especially for voice agents. It's very awkward when we have that awkward pause. It was even mocked in Super Bowl commercials right this year. So that's where the LPU technology is really ideal because I prefer the term fixed latency even more than low latency. These LPUs, Groq, Cerebras, and others, even TPUs to a certain extent are really, really good at these fixed 100 millisecond and below latency interactions, but that's a certain model size, not the largest models, not agents. And it's a certain query size, certain prompt size, typically 8K or below, not 100K or now the million token prompts and context windows we see. Creating a spectrum of solutions required and integrations required. So alongside the LPUs, we still have to have the GPUs ostensibly for prefill and we have to have this critical element in the middle, this context memory, so that you're not having this excessive amount of traffic over these expensive high performance network between GPUs and LPUs. We can prefill on the GPU, stored at memory speeds on shared context memory storage, and then be able to decode that on LPUs. So it's an elaborate architecture that's forming here, but it's filling these needs of these segments that are emerging as the market matures.
Gemma Allen
>> And in a world where memory suddenly becomes the prominent feature, right? As opposed to just something that happened in the background that you didn't really think a whole lot about, now it's an enabler, it's a competitive advantage. How does that change the buyer journey for both of your companies? Does this become an integrated part of a package for you, for example, Daniel? Does it become something that you suddenly market in a different way? Talk to me a little bit about how the conversation meets the market.
Daniel Kearney
>> Yeah, so when we interact with customers, we're obviously, from an internal perspective, we're looking at making every watt counts. But of course, customers wants performance, they want optionality, they don't know their future in every way, right? Any customer that said they knew that huge amount of memory in March last year and then suddenly the rise of agents came as the way it did. It's obviously, they can't predict that. And there will be more of those moments probably in the future. So this ability to bring in more specific silicon or more extended context to fit workloads, even retrospectively, continue to extend the usefulness of a CapEx investment that happened prior, that's huge. That allows a physical flexibility that didn't normally exist in the past. For us, we don't have, with this type of capability and working with the WEKA team, we can engineer out obsolescence. We can now bring in an existing GPU system or GPU-based system to market that's ready for the next generation of workloads without having to redeploy and throw out the old to bring in a whole new system. That's not necessary. We still keep relevant for customers, and they will evolve themselves. Their workloads change. It's not always easy for them to predict how their customers will consume their product and we could stay relevant and future-proof.
Gemma Allen
>> It's about the experience.
Daniel Kearney
>> Exactly.
Gemma Allen
>> But for you, Val, it's a moment of reckoning, right? It's a moment of reckoning in this industry because you have been talking about something that was a little bit backstage, if not very backstage for a very long time. So how are you thinking about it? It's a great time to be a marketeer at WEKA. How do you think about the message to the market in this moment?
Val Bercovici
>> It's very fulfilling, obviously, to be proven that we were right, but that's not what's valuable to us right now. What's valuable to us right now is offering integration points with great partners like Firmus so that your tenants can deploy this today if they want for maximum leverage. And the reality is, again, if we go back to Jensen's keynote, insatiable demand for token driven by agents, including NemoClaw, OpenClaw right now, which means there's an opportunity to capture that token revenue right now as we speak today, and there's an opportunity cost to wait. And what we're going to find by the time we come back next year and talk is who are going to be the new giants, the new king makers that didn't wait, that seized the day, carpe tokenum or whatever, carpe diem, right? Seized the moment, implemented this today as a tenant on firms and was able to just use the leverage of augmented memory to capture not only 6.5 times the number of amount of revenue users as tenants of Firmus, but for agents. We measure agent now work time in hours and days and that compound benefit of having faster and faster agent turns spread out over tens of thousands of agent turns means drugs are being discovered faster. Cures are being discovered faster. Trades are being optimized better. There's so many use cases right now where there's massive business value. The simplest case is better PowerPoints faster, right? All sorts of valuable things that, again, are available today and the winners and losers will be determined by who sees is the moment right now.
Gemma Allen
>> And Daniel, speaking of use cases and the global remit of Firmus and also this opportunity, right? It's a very global opportunity. You're interesting from the position that you guys have very solid footing in APAC, but a global model.
Daniel Kearney
>> Yes.
Gemma Allen
>> Talk a little bit about what sorts of customers you're seeing, what sort of patterns are emerging that perhaps didn't exist a year ago.
Daniel Kearney
>> Yeah. So we serve customers in Singapore, a mixture of hyperscaler and government customers, and some of them are actually foundational labs that are building out national AI models. So that's a very exciting place to be. They're very much at the cusp of what can be captured in Singapore. We're also growing in Australia with our Project Southgate. We're going to deploy up to 2.7 gigawatts of AI capacity over the next two to three years, which is extremely exciting. And we're going to be exposed to a huge amount of change in the workload space with those customers coming on because they're taking huge tenancies and they're going to be quite active in model building as well. We're very much going to be exposed to the world's leading model builders and leading model providers from APAC, and we're going to build out as a main sovereign cloud, but also a global support vehicle for that. We're excited about what's coming and having optionality in the way we build these clusters and building these clusters that are future-proofed for a world that we are finding it hard to predict right now, at least in the AI space, that's really good. And remember, we're ruthlessly focused on efficiency and this is another way of obviously the energy piece, but also how tokens are used. Carrying tokens, putting them to work where they're needed rather than cycling through unnecessary algorithm loops, this is also important. These are just other features that agents will be doing, carrying context with them, bringing it into the next compute rather than having to re-compute all the time. So it's one thing about tokenization, use of tokens, but using them efficiently and being able to make sure that whatever is produced, whatever what has converted into intelligence is being used the way intelligence it should be and it's cascading forward. We're always thinking about that from that model to grid thing that I told you about the philosophy of the company, every efficiency point counts because it's on us as well to try to drive that as a company.
Gemma Allen
>> Let's talk for a second about the cost of potential wastage, right? Especially in this inference era we're entering into. There is a belief that you just need as many GPUs as possible, right? But the reality is, if you have any GPU sitting idle, it's a huge cost rain to your business. What are you guys actually seeing from the perspective of customers that you talk to day in, day out? How prominent is this challenge and how much control do they really have over a solution in such a complex time of scaling outward, right?
Val Bercovici
>> Yeah. Today, it's very prominent. This concept of efficiency, again, why we're natural partners and sovereign clouds play into this really, really well. If you look at what was emphasized on the keynote stage again, we're at the inflection point of inference and inference is conceptually these two error, two phases, the compute intensive prefill, the memory intensive decode. Today, GPU efficiency, you can have a very busy GPU that's prefilling redundantly over and over and over again because it has a very limited memory window. So as soon as some new prompts come in, all prompts have to be evicted, and then you're inefficiently refilling, re-pre-filling on the GPU side. So does the GPU look busy? Yes, but engineering has a term for this. There's throughput and there's good put. And good put is efficient, useful work, useful work that the GPU is doing. So in our case, conceptually for a long-running agent, instead of prefilling about a million times across 10,000 turns across a day or two, you prefill once. It's called an order of one operation instead of order of N. And that's a magical optimization for engineers that this technology enables. And then you spend all of your time and resources and effort on rapidly decoding, which makes your agents more reactive and responsive, shortens those day long agent runs into hour long agent runs, and gets us to our results faster. And particularly for sovereign AI, I love the bookends you have right now of Singapore and Australia. In Singapore, energy is a very, very finite thing. Real estate, you could argue is fine, but energy is super finite. It's the ability to generate 550% more tokens from that same energy budget is a game changer for sovereign AI.
Gemma Allen
>> It's next Monday, we are all home from GTC, we're recovered and you have some tenants done that think, okay, wow, I would like to leverage that opportunity, right? I want those cost advantages. How will this partnership play out in the market?
Daniel Kearney
>> Yeah. I mean, we're going to be giving customers as much choice and as possibility. And we have a pre-established relationship with WEKA for a number of years now so this kind of POC that we're working on. What I love about WEKA as well as the engineer engagement, they're very much a technical entity just like ourselves so we're out to solve problems and bring value to customers so I think that will continue into the future. And like I said, efficient options for customers and making sure they match what the customers actually need, fit for purpose. I think that's, we're going to always continue to push that wherever we can and make sure that if the local entities in Singapore need to push limits, then we have an option for them right there and then, and of course onto Australia as well.
Gemma Allen
>> Well, nerds love nerds, right?
Daniel Kearney
>> Yes.
Gemma Allen
>> Okay.
Daniel Kearney
>> Gravity.
Gemma Allen
>> Okay. Lastly, and I'm going to give you a question first of all. So based on your predictions last year, if we have any Polymarket or Kalshi enthusiasts listening or watching this, what are your predictions for GTC next year? What do you expect will be a core keynote topic?
Val Bercovici
>> I'm going to be reflective a little bit here because if I were to tell you last year that the 60 plus year mature industry of software development would be completely transformed and upended over the next few months, you would not have believed me last year.
Gemma Allen
>> No.
Val Bercovici
>> So I think the most important thing to be open to is completely unimaginable outcomes a year from now. But the safe and easy options are we're just going to see a proliferation of agents, agent orchestration systems. We're going to be seeing a lot more productivity out of our organizations based on these technologies. And I think we're going to be seeing a lot of proof points because we're seeing a lot of promise of these heterogeneous GPU and LPU architectures. We're seeing them promised basically promoted this year. We're going to be seeing a lot of these really impressive proof points on the infrastructure side that are real, and again, just unimaginable outcomes in terms of business value. Pick your next credentialed industry. It's going to be completely upended and transformed this time next year.
Gemma Allen
>> Less noise, more signal. And what about you, Daniel? What are you predicting for the year out?
Daniel Kearney
>> Well, I'll start with the things that I think won't change, which I think efficiency will become even more important, right? As we scale into energy ecosystems that are becoming more challenged, we've got to respect every part that we can in terms of how we convert that into value. The amount of AI natives that are going to be pushing the limits is going to grow as well. We're going to see probably more of those workloads. This morning I was walking along one of the pavements and I was surrounded by a load of little robots and I just thought that was very exciting. And I think they haven't yet had their time in the way they can and they will. And I think once that type of automation and robotics comes into society, we're going to see GDP moving stuff and they're going to require a lot more fit for purpose AI fabric choices. A lot of it will be memory bound, being able to hold decisions and whether they're reasoning or whether they're handling different things. There'll be an agent in that as well. That'll probably be pushed. I'd really be excited to see where the physical AI in combination with that autonomous or agentic code will go as well. So that will be, those are my two things, a bigger focus on efficiency and probably an enrichment of that stack, that physical stack as well, the software, and the ability for more of this type of stuff that we're doing that will come into a new workload space in physical AI.
Gemma Allen
>> The world that's vertically integration, horizontally aligned, right?
Daniel Kearney
>> Yeah.
Gemma Allen
>> To quote the man itself. But listen guys, thanks so much for coming on theCUBE. Great conversation.
Daniel Kearney
>> Thank you.
Gemma Allen
>> Excited to see what the year ahead looks like for you both, and yeah, keep in touch.
Daniel Kearney
>> Thank you.
Val Bercovici
>> Awesome, thank you.
Gemma Allen
>> I'm Gemma Allen here at the WEKA Pop-Up in San Jose. It's NVIDIA GTC 2026, so much happening here in theCUBE. Stay tuned.