theCUBE + NYSE Wired: Physical AI & Robotics Leaders QA2 | Mitesh Agrawal, Positron AI

Clips
More from theCUBE + NYSE Wired: Physical AI & Robotics Leaders QA2

Mitesh Agrawal

CEO

Positron AI

play_circle_outline Transforming AI Efficiency: Positron AI's ASIC Solutions for High-Demand Inference Workloads and Cost-Effective Revenue Generation

play_circle_outline Competitors like NVIDIA excel in training but face challenges in niche application inference.

play_circle_outline Positron AI products integrate seamlessly into existing NVIDIA workflows without code changes.

Info
Transcript

Mitesh Agrawal, Positron AI

Mitesh Agrawal

CEO Positron AI

Mitesh Agrawal, chief executive officer at SV Robotics (Positron AI), joins theCUBE’s Dave Vellante during theCUBE + NYSE Wired: Robotics & AI Infrastructure Leaders 2025 event to discuss the future of AI inference in robotic infrastructure. The conversation highlights Positron AI’s unique approach to performance-per-watt efficiency and real-world inference deployment.

Agrawal shares how Positron’s custom silicon design delivers compelling alternatives to dominant players such as Nvidia, especially in edge and token-processing environments. With an e... Read more

explore Keep Exploring

What is the reasoning behind the founding of Positron AI and what are the company's primary goals? add

What are the advantages of using different GPUs for training and inference in AI workloads? add

What is the process for uploading raw weights in an engineering context, and what are the current model support limitations? add

bolt Powered by CUBE AI

Mitesh Agrawal, Positron AI

search

Dave Vellante

>> Hi, everybody. Welcome back to theCUBE's Palo Alto Studios. This is theCUBE plus NYSE Wired's Robotics & AI Infrastructure Leaders. My name is Dave Vellante. John Furrier is here. We're going to take a little deviation from the whole robotics conversation and talk a little bit about AI inferencing, which is super hot topic. Mitesh Agrawal is here, the CEO of Positron AI. Mitesh, thanks for coming on.

Mitesh Agrawal

>> Thank you.

Dave Vellante

>> Years ago, David Floyer and I wrote a post. We did a breaking analysis, said that AI inferencing at the edge is going to be the dominant use case and we get through training and that is going to be the biggest market and it seems to be playing out that way.

Mitesh Agrawal

>> Yeah.

Dave Vellante

>> I mean obviously right now training tons of money pouring in, but we're setting up for that big wave. What's your perspective on all this?

Mitesh Agrawal

>> Yeah, I mean, just confirming that basically, I mean we've been seeing number from the big companies, Google saying processing 50 trillion tokens per month, which was less than a trillion token, less than a year ago. So you're seeing literally exponential rise and that 50 trillion tokens is just right about now. I just saw a study yesterday in about 14% of United States users use AI today in their daily lives, so that's just 14% and the second country is India at 8.5% or something.

Dave Vellante

>> That's it, huh?

Mitesh Agrawal

>> That's it.

Dave Vellante

>> Wow. Get on board people.

Mitesh Agrawal

>> All of the users is inference demand, so just think about how much more scaling there is from all angles. Right. And you think about talking about infrastructure, energy, chips, compute to actual just end user applications, so there's still a lot more to go but a lot higher.

Dave Vellante

>> What's the premise of Positron AI? Why did you guys, your founders, why'd they start the company?

Mitesh Agrawal

>> Yeah, I think the main reasoning is exactly that, which is that the inference demand is going to keep on increasing and inference is a revenue generating activity. So people care about the cost structure for it because they want to drive maximum profitability. And so you basically, from a technical perspective, you want to get as efficient, so performance per dollar, which is your capital expenditure and performance per energy or watt, which is your operating expenditure. So you really want to basically design a chip that is really, really efficient for these inference use cases. Now, NVIDIA and AMD, especially NVIDIA, which is the absolute one of the smartest companies and one of the best architectures out there in terms of GPUs, they're really good at many things including training, and they're good at inference too. But as applications get larger, as they get more niche, there is so much more room for efficiencies in particular applications that you want to basically drive towards that. And that's where Positron AI comes in. We are basically an ASIC company designing for inference workloads in the beginning, really, really honing in on architectures like transformer architectures, which are basically the driving force behind today's ChatGPT and other LLMs that are out there like DeepSeek and Mistrals and Llamas of the world. So we are really designing a chip that is fundamentally focused on driving that efficiency value. So performance per dollar and performance per watt. And we really think that we have a unique design architecture on the chip side as well as a very, very appealing software story such that we can, for those applications, be highly competitive to the existing offerings out in the market and obviously be better than them. That's how we get market adoption and that's the focus for the company.

Dave Vellante

>> I want to ask you, because Jensen speaks, we all listen and on his earnings calls he'll say X percent of our enterprise business was inference. Presumably that's a lot of ChatGPT and that's a big number. It's like 40%. And so you hear that and a lot of people might think, well, then NVIDIA is dominant but training, why aren't they going to dominate inference? And he'll talk about how, well, you train and you infer, you train and you infer. And so help us understand, I mean I think it does come down to the economics and the efficiency and then I want to get into what's unique about your ASIC.

Mitesh Agrawal

>> Yeah, I think to your point to extend that absolutely today when you train, NVIDIA is the go-to, no questions asked anywhere, and then people continue to use NVIDIA GPUs for inference, rightly so. They're top of the line H100 couple of years ago now Blackwell series and then upcoming Rubin, really top of the line performance and things like those. But I think one of the conceptions in the market is that, oh, if you train on something, you have to continue to infer on that same chip. That's not the case. Especially, I mean today that's the case because both NVIDIA is top of the line in terms of performance, but then also they have the CUDA stack, which is an absolute phenomenal piece of software that everyone uses. So it's the workflow, everyone is accustomed to it's all installed there. This is where Positron steps in is like we are saying that absolutely do your training on NVIDIA, but then if you want to run inference, we can drive much better performance per dollar and performance for wattage compared to NVIDIA cards with our current first gen cards and then obviously with what we're going to be launching in the future. And the software piece of that is what we are telling, and this is one of the unique pieces about us, is we basically are telling our customers, and they have experienced this, we have real customers that we have deployed with. Basically they have to make no changes from the workflow that they already do with NVIDIA cards when they do inference and just swap NVIDIA cards instead of Positron cards and Positron systems and you'll be able to run similar inference like no-code change.

Dave Vellante

>> Really?

Mitesh Agrawal

>> You don't have to compile anything, but we are focused on the transformer architecture. So for that inference applications, we are kind of plug and play compared to that. So that's the software USB. And then on the hardware design side, it's the unique IP design on the chip design itself, how we are processing the data and then the memory architecture that's allowing us to drive that higher efficiency there. So that's the two things that we are focused on. And it's not like a fact that, oh, you are not saying, obviously NVIDIA is going to continue to be the dominant silicon provider of the world, but it's around as applications grow, the market is going to keep on growing, and then you find these applications where you're going to be better than them because you're focused on certain efficiency metrics either on hardware or software side, and that's how you drive our market presence and our growth

Dave Vellante

>> And they're maybe not going to optimize for that use case. So you speak CUDA, is that right? You speak-

Mitesh Agrawal

>> So I want to be very, very specific about it. So what happens is we plug in right into the NVIDIA ecosystem. So what I mean by that is when you train a model, you get a file called .pto.safetensor, that's what you upload on Hugging Face, which is the raw weights of your model. And then when you're running inference, you upload those raw weights onto the NVIDIA GPU and then it has its own backend and then it basically gives you an API again which you can hit tokens and you get tokens out. That's kind of what you're using the chatbots. So we do the exact same. We take those raw weights and upload it on our system and you get an API out that can do the tokens. So from an end user, from an engineering perspective, uploading the raw weights either on NVIDIA system or Positron system is exactly the same. So the workflow is exactly the same for inference. Now we are focused on only transformer library that is there on the Hugging Face. So you can do that for that, but if you have some other type of model, for example, today if you're wanting to run diffusion models, today we don't support it out of the box. So I want to say that yes, we are part of the NVIDIA ecosystem, but it's not because we have somehow tapped into CUDA or anything like that because if you're tapped into CUDA, then we could support everything the sun. But no, it's because of we are taking the output of the NVIDIA training run, which is what NVIDIA GPUs also do to run inference, and we are doing the same exact thing. So we are part of it. What we are not making customers do is, "Hey, you have to compile this model. Every time a new model launches, you have to compile it," or "Even if you have to do fine tune, you have to compile it or change some code." We don't have to do that. Because of that, we are very widely usable within that transformer library.

Dave Vellante

>> This is the workflow that's really plug and play.

Mitesh Agrawal

>> Yeah, exactly.

Dave Vellante

>> Okay. And then tell me more about your ASI. It's so interesting, the silicon business, how ARM and TSM have just changed that business. I mean, John Furrier used to joke that theCUBE is going to design an ASIC and VCs, you never want to invest in that now it's-

Mitesh Agrawal

>> From hot to cold to hot.

Dave Vellante

>> Yeah, it's very hot right now. So tell me about your architecture, your memory architecture, your high bandwidth memory, what's it look like?

Mitesh Agrawal

>> Yeah, so we are really focused on when we are thinking about transformer models, they're fundamentally matrix multiplication. And even within that, on the inference side, there are matrix vector math. So basically you really want to have very high memory bandwidth. You obviously need a certain level of flops and compute, but you also need very high memory bandwidth because you have to move a lot of bytes of data for using those flops. So because you have to move that, you need high memory bandwidth. Now with a lot of cards out there, including NVIDIA cards, they use HBM memory. So the theoretical memory bandwidth available is very high. For HBM3, it's like three terabytes per second. But when you actually run that for inference jobs, the realized memory bandwidth is only one third of that, roughly 30% or so.

Dave Vellante

>> It can't take advantage of the architectural... It can't get close to the architectural limits.

Mitesh Agrawal

>> Yeah. The architecture can not take advantage of all the actual memory bandwidth available. Whereas for us, the architecture takes advantage of it and we can drive over 90% of the available theoretical memory bandwidth that is there. So we can really drive high memory bandwidth, which allows us to flow more bytes of data into the compute. And as such, use compute at a much higher utilization ratio. So this basically lowers our performance per watt and performance per dollar because from a use case perspective, but that's the fundamental kind of 40 feet view of what is driving really the innovation on our side. It's not like new physics. We are not doing optical or quantum or anything silicon. It's new silicon it's a new design for silicon, but it's basically how the memory management and then the fundamental elemental unit of the chip, how we process that compute flow is what we're driving the real performance. So at the first step, what we did was a lot of ASIC companies, this is one of the key things, we are very much product and customer focused in the sense that we want to get out products out in the market as quickly as possible. So a lot of ASIC companies, they have an idea a lot of times brilliant ideas, and then they take 2, 3, 4 years to really come out with their first product in the market. ASIC companies just take time. What we did was for our first product, we got out in the market in less than 15 months with just seed money amount of raise. And what we used was an FPGA, which is a programmable gate array card, and we basically designed our architecture on top of FPGA and showed to the world that our architecture actually works and drives really, really high performance. So today for a lot of the Llama model ecosystem, the open source model ecosystem, we can drive anywhere from two to 4X performance per dollar and performance per watt efficiency or an NVIDIA Hopper H100 card. Now obviously we have to stay up to the market. NVIDIA is already on Blackwell, so our ASIC that is going to come out, it's going to really drive competition against the Blackwell and Rubin cards of the future. But we already have our shipping and we have shipping our first product out already.

Dave Vellante

>> Because you built on top of an FPGA, if I'm saying that correctly.

Mitesh Agrawal

>> Correct.

Dave Vellante

>> The programmability of your ASIC, does that have implications on time to market and how long it takes you to maybe you don't have to do tape out every next generation.

Mitesh Agrawal

>> So when you're building on FPGA that does allow that because FPGA is technically already taped out and you-

Dave Vellante

>> Yes. Right.

Mitesh Agrawal

>> Yeah, right? So that did allow us to get our first product out so quickly, and this allows us to learn from our customer use cases and then take that to ASIC. But FPGAs are that if you talk to anyone in the industry, they'll say that FPGAs are not going to be a long-term solution to drive efficiency, rightly so from a specific application, because FPGAs, you can put your hardware architecture on it, but it's not going to be the most efficient in terms of the brute specs of it, like the flops available, the memory available, the type of memory flow. So you have to design an ASIC for it, but at least what we have done is proven out our product and actually deploying it in production environments to show that even with those limitations, we can drive this much higher performance. So when we design a specific chip for it, we can drive even more. So you do have to do the tape-out process and that's why ASIC companies take two years, three years, four years to come out with a product, but we really wanted to get a first product out, get real feedback from customers, and then drive that into our kind of ASIC design and ASIC outcome.

Dave Vellante

>> What are the use case patterns that you're seeing evolve?

Mitesh Agrawal

>> Yeah, I think the primary thing today, I mean because it's a little bit bias-free because we are supporting transformer architecture, you're looking at mostly them text-based models. So LLMs, whether it's in chatbots code generation. So what we do as a company is we target the general purpose underlying of it. So inference as service providers for example, or certain niche applications that are large enough, whether it's in-game content generation or code generation, especially if we are really good at generation part of the LLM inference stack. So these are the primary use cases driving it. But if you think about it, chatbots are fundamentally used in so many different types of verticals, like all the way from serious enterprise businesses like healthcare and financial, tech to all the way like consumers like role-play and things like those. So fundamentally, but that's just underlying is in large language models. So that's what we are powering through general purposes. So we just announced an offering through Parasail, for example, which is an inference as a service provider. And they have users that have different types of use cases. One of the key things for us is we can host many, many models on the same card. So on a GPU generally you are hosting one model per GPU, and then if you want to swap it around, you need to load it and it takes some boot time. Whereas our architecture allows many tens, hundreds of models to be loaded at the same time.

Dave Vellante

>> Okay, so you can target those vertical cases much more rapidly.

Mitesh Agrawal

>> Yeah, exactly. Imagine you have a fine tune for you, which is your office assistant. I have a fine tune for me, which is my office assistant, and instead of dedicating a card to each one of us, basically the company that is serving it can put all those models in the same card and then depending on the use case can scale that up or down. If you're constantly using yours and are having a lot, then they can dedicate more cards to you. But the model is always active on a single card.

Dave Vellante

>> Where are you in your go-to-market? Do you have product market fit? Are you scaling your go-to-market?

Mitesh Agrawal

>> So yeah, I mean we are a young company from ASIC timeline perspective, we're two years old and we launched our product roughly four to five months ago. So right now we are very early in the case of product market fit, if you might want to say, I consider product market fit as having huge amounts of revenue. We're talking about hundreds of millions of revenue, but right now we have early customers that we have deployed with. There's few of them have converted to production customers, few of them are still in proof of concept. So that's kind of where we are currently at. We are really driving that over the next six months into the market with our first generation product. And then while in ASIC world, you never sleep, you have to always think about the next and the next product for it. But in the terms of product market fit. Today, what we have shown is a real performance advantage for our architecture, and that's what we're driving initially with our customers.

Dave Vellante

>> That has implications for your go-to-market. I mean, you don't want to scale your go-to-market until you're really nailed your product market.

Mitesh Agrawal

>> Exactly.

Dave Vellante

>> You're very intimate with that go-to-market. I presume you're consultative selling, right?

Mitesh Agrawal

>> Yeah, it has to be right now because we learn as much as we give the product and the more people that use it. And then the second thing is also just around practical scaling, actual supply chain production, actual ability to have the balance sheet to scale that out and things like those, that comes from continuously building your sales pipeline and raising.

Dave Vellante

>> And you haven't raised a ton of dough, is this is my data right? You'd raised like 23, 24 million.

Mitesh Agrawal

>> Correct. Yeah.

Dave Vellante

>> It's very efficient.

Mitesh Agrawal

>> Yeah. I think one thing we can be a little narcissistic or proud about in a way is that our capital efficiency, we launched our first product actually when the company only had raised 12 and a half million. And again, if you look at the ASIC company's background and history, generally it takes hundreds of millions to get a first product out. So we are very proud of that. Obviously you can imagine that showing that capital efficiency continues to drive what we are going to do in the future and allows us to raise more. But yeah, we have only publicly announced raise of $24 million and really have gotten our first product out with customers and real paying customers and then are going to continue to build on our next ASIC.

Dave Vellante

>> Well, excellent. Congratulations, really appreciate you coming on theCube.

Mitesh Agrawal

>> No, thank you for bringing-

Dave Vellante

>> And then energy, performance per watt. It's the big blocker, right?

Mitesh Agrawal

>> Yeah. I think infrastructure world is right now, really looking at it very deeply. And I'm sure you're talking to Chase after this.

Dave Vellante

>> Yeah. Exactly. He's living it.

Mitesh Agrawal

>> Exactly. He'll enlighten you, he's one of the best people out there about it. But yeah, I mean driving, first of all, period, we have to grow out energy production. So there's no two ways about it, but whatever energy we do have, we should drive as much performance out of it too as well because neither of those will be enough for what is coming. That's what it is.

Dave Vellante

>> All right, Mitesh, thank you very much. Appreciate it. Stay right there. Okay. And thank you for watching. We'll be right back right after this short break. This is Dave Vellante for John Furrier entire theCUBE team, theCUBE plus NYSE Wired. You're watching our Robotics & AI Infrastructure Leaders right back ready for this short break.