In this conversation at theCUBE + NYSE Wired: AI Factories – Data Centers of the Future, theCUBE’s John Furrier sits down with Vipul Prakash, co-founder and CEO of Together AI, to explore how AI factories are redefining enterprise infrastructure. Prakash details the breakneck growth of AI-native applications – where usage that once took SaaS nine months to double is now happening in nine days – and why this demand is forcing a rethink of compute, storage and networking. He explains how leading apps segment traffic across closed APIs and fine-tuned open-source models, and why efficiency, scale and low latency make AI factories the new unit of value in the enterprise stack. The discussion highlights the critical role of data adjacency (fabric-connected, parallel storage placed next to models), continuous testing for gen AI (from entropy checks to A/B rollouts and ECC/network reliability), and the organizational advantage of appointing a chief AI officer to cut through policy and security inertia. Prakash also shares how Together AI is addressing real-world constraints, most notably power, and reveals progress building new AI factories in Maryland and Memphis, with more on the way.
The interview further dives into AI-scale architecture trends (transformer evolution, sharded attention/experts) and enterprise use cases, from digital-native builders to regulated sectors. Prakash outlines Together AI’s end-to-end model: from energized data centers and GPU-dense infrastructure to developer experience, sovereignty and security, aimed at producing more tokens from the same footprint. He closes with a look at unifying training and inference on shared infrastructure to smooth peak/average loads and improve economics, plus customer momentum ranging from Zoom to VFS (visa processing with 156 governments). It’s a pragmatic blueprint for how AI factories are becoming the cornerstone of digital strategy – and the most profound infrastructure shift since the dawn of the internet.
Forgot Password
Almost there!
We just sent you a verification email. Please verify your account to gain access to
theCUBE + NYSE Wired: AI Factories - Data Centers of the Future. If you don’t think you received an email check your
spam folder.
Sign in to AI Factories - Data Centers of the Future.
In order to sign in, enter the email address you used to registered for the event. Once completed, you will receive an email with a verification link. Open the link to automatically sign into the site.
Register for AI Factories - Data Centers of the Future
Please fill out the information below. You will receive an email with a verification link confirming your registration. Click the link to automatically sign into the site.
You’re almost there!
We just sent you a verification email. Please click the verification button in the email. Once your email address is verified, you will have full access to all event content for AI Factories - Data Centers of the Future.
I want my badge and interests to be visible to all attendees.
Checking this box will display your presense on the attendees list, view your profile and allow other attendees to contact you via 1-1 chat. Read the Privacy Policy. At any time, you can choose to disable this preference.
Select your Interests!
add
Upload your photo
Uploading..
OR
Connect via Twitter
Connect via Linkedin
EDIT PASSWORD
Share
Forgot Password
Almost there!
We just sent you a verification email. Please verify your account to gain access to
theCUBE + NYSE Wired: AI Factories - Data Centers of the Future. If you don’t think you received an email check your
spam folder.
Sign in to AI Factories - Data Centers of the Future.
In order to sign in, enter the email address you used to registered for the event. Once completed, you will receive an email with a verification link. Open the link to automatically sign into the site.
Sign in to gain access to theCUBE + NYSE Wired: AI Factories - Data Centers of the Future
Please sign in with LinkedIn to continue to theCUBE + NYSE Wired: AI Factories - Data Centers of the Future. Signing in with LinkedIn ensures a professional environment.
Are you sure you want to remove access rights for this user?
Details
Manage Access
email address
Community Invitation
Sid Sheth, Founder, d-Matrix
In this conversation at theCUBE + NYSE Wired: AI Factories – Data Centers of the Future, theCUBE’s John Furrier sits down with Vipul Prakash, co-founder and CEO of Together AI, to explore how AI factories are redefining enterprise infrastructure. Prakash details the breakneck growth of AI-native applications – where usage that once took SaaS nine months to double is now happening in nine days – and why this demand is forcing a rethink of compute, storage and networking. He explains how leading apps segment traffic across closed APIs and fine-tuned open-source models, and why efficiency, scale and low latency make AI factories the new unit of value in the enterprise stack. The discussion highlights the critical role of data adjacency (fabric-connected, parallel storage placed next to models), continuous testing for gen AI (from entropy checks to A/B rollouts and ECC/network reliability), and the organizational advantage of appointing a chief AI officer to cut through policy and security inertia. Prakash also shares how Together AI is addressing real-world constraints, most notably power, and reveals progress building new AI factories in Maryland and Memphis, with more on the way.
The interview further dives into AI-scale architecture trends (transformer evolution, sharded attention/experts) and enterprise use cases, from digital-native builders to regulated sectors. Prakash outlines Together AI’s end-to-end model: from energized data centers and GPU-dense infrastructure to developer experience, sovereignty and security, aimed at producing more tokens from the same footprint. He closes with a look at unifying training and inference on shared infrastructure to smooth peak/average loads and improve economics, plus customer momentum ranging from Zoom to VFS (visa processing with 156 governments). It’s a pragmatic blueprint for how AI factories are becoming the cornerstone of digital strategy – and the most profound infrastructure shift since the dawn of the internet.
>> Hi everybody, welcome back to the New York Stock Exchange. This is the NYSE Wired plus theCUBE's AI Factories - Data Center of the Future, and we're super excited to have Sid Sheth here. He's the Founder, President, and CEO of d-Matrix. Sid, thanks for coming on. Good to see you again.
Sid Sheth
>> Yeah. No, thanks for having me, Dave. It's always wonderful to chat with you.
Dave Vellante
>> So last time, we chatted was June, you were in our Palo Alto studio for our AI and robotics leaders programming. Not much has happened since June, huh?
Sid Sheth
>> I can barely keep up, Dave. I can barely keep up.
Dave Vellante
>> It's incredible. Jensen was in the exchange yesterday. Michael Dell was here. Of course, AI factories is a big theme. When we talked last, we talked about the shift from training to inference. Now, everybody is talking about it and you said at the time training is a performance problem. Inference is really all about efficiency, performance per watt, cost. And so I wonder if you could sort of talk about how that's, first of all, come to the frontal lobe of most people's minds here and how that shift is changing the design of data centers and edge. What does it mean for the next generation AI?
Sid Sheth
>> Yeah. Yeah. I think when we spoke in June, this was post the DeepSeek moment. And I think my point of view at the time was we have officially entered the age of inference. And I think all these announcements that you've seen in the last six months are further validation of the fact that the age of inference is truly here and everyone is trying to play catch-up because so much time and money and resources has been spent, call it over the last 10 plus years on training larger and larger AI models and making them more capable. And then you have this watershed moment in the early part of this year where you realize that we have reached a point in the capability of these models where you can now take them and make them really efficient. You can distill them down, make them smaller, and they're still as capable and there is a lot more people who can use this, not just a few people. So I think you're just seeing a cascade of events after that DeepSeek movement in January of 2025 that has catalyzed the age of inference and now it is all about deploying a lot of compute to serve the needs of humanity. And that is going to require so much compute that people are just trying to put their arms around, "Wow, how much inference in compute do I need to run to really serve all of humanity and serve all the applications I want to build?"
By the way, the training still continues. It's not like training comes to an end. People are still wanting to train bigger models and get to AGI, so that quest continues. So at this point, you're looking at these mega, mega announcements, hundreds of billions of dollars of compute being deployed to serve both the training needs of the future and, of course, the exponential inferencing needs of the future.
Dave Vellante
>> We asked Michael Dell, did you see any signs of demand? I mean, are you sleeping with one eye open? Of course, he's seen bubbles burst before and he said absolutely no end in sight for this demand. It just seems to be insatiable. So people, I'm sure, yourself, you're looking for those signs, but people talking about the circular reference and taking on a little bit more debt, but it really does feel like the mid-part of the 1990s, '96, '97 than it does '99, doesn't it?
Sid Sheth
>> Yeah, everyone is trying to put their arms around where we are in the cycle. And I think, again, I would warn against comparing to the past, right? Because typically, the way the internet bubble rolled out and there were certain unique set of events that happened leading to what you call the dot-com crash in the late '99, early 2000. It's not like we are seeing similar patterns of behavior, because obviously, the promise of the opportunity is just so large, everyone wants to get in on it, but I would caution against trying to draw too many parallels to what has happened in the past, because that was a different point in time. Capital needs were very different. Availability of capital was very different. Towards the end of the decade, we were in a rising interest rate environment, not a falling interest rate environment. So yeah, you could be right. I mean, it doesn't certainly seem like we are close to the end here. Inference is certainly just getting started. We are in the very early innings of inference compute, the deployment of AI, the part of AI where people are going to find a way to recoup their massive investments and actually build applications and introduce productivity. So I think that is still very, very early, so it does feel like we are not anywhere close to the end right now. And to me, I would agree. I think I don't see an end in sight yet and lots more to do actually. Lots more to do as an industry, as the amount of work that needs to be done to get all of this rolled out, so I think there's a lot more to go.
Dave Vellante
>> I think that's good cautions. The past is not prologue and you've got to be there. You've got to be there investing. This is a wave that you want to ride. So do you think that the AI infrastructure is going to bifurcate? I mean, we've talked about this that only a handful of companies are going to be able to go after the million GPU clusters, the mega clusters and chase AGI, but most of the market's going to be focused on your space, cost-efficient, your deployable models. DeepSeek, as you said, was a real watershed moment. How do you think that bifurcation, if you agree with that premise, is going to shape the data center landscape, the edge landscape? Are we going to see sort of training factories and inference factories? Do those two worlds come together in some way? What are your thoughts on that?
Sid Sheth
>> Yeah. That's a great question. You're going to see a lot more compute. I think that's the underlying theme here, because our thesis at d-Matrix was always that inferencing compute is going to be 100 times more, 10 to 100 times more than training compute. So whatever you've seen being built so far has been primarily been done for training compute. And what you're beginning to see now in 2025 is these mega announcements, a lot of them are around inferencing compute. So people are beginning to look forward and say, "Wow, with the availability of these reasoning models and with the availability of the capabilities of these models just improving exponentially, you can do a lot more agentic capabilities becoming more readily available." The amount of compute we're going to need is just 10 to 100 times more than what we did on training. So my god, what does an infrastructure build for that opportunity really look like? Do these two coexist? I think the two coexist. I think there's, again, the dynamics has not changed. There's four or five companies in the world who can really afford to go on the AGI journey, and that is where you need 100,000 GPU clusters going to 500,000 GPU clusters going to call it a million GPU clusters sometime in the future. This openAI announcement with NVIDIA and AMD is about getting to that point where you are deploying a 100,000 to 500,000 GPU clusters so that you can train the next big model, unleash a new set of capabilities. Because every time you ... The hope is that every time you launch a new frontier model, you launch a new set of capabilities, new set of intelligence, new levels of reasoning, and that will happen with GPT-6 whenever that comes out. And that will then unleash a whole new set of capabilities and smaller models will come out of that and a lot more inferencing will happen because of that. So I think the tool will coexist. You'll have very large factories for training, which will be owned by a few folks because it's so capital intensive, but there'll be many, many, many more inferencing factories. And inferencing is going to be done at many, many different points in the network. There's going to be mega factories, there's going to be edge factories, there's going to be small computing nodes, large computing nodes. The hyperscalers, the new clouds, the sovereigns, everybody needs inferencing compute. So I think that is just going to look a lot more spread out and a lot more distributed. The training factories are a lot more consolidated relatively speaking compared to inference.
Dave Vellante
>> Yeah, the deals that are going down with OpenAI are just mind-boggling. First of all, the NVIDIA investment. I mean, Jensen has said, "I wish I could have invested earlier." And then AMD turns around and there's no strings attached, is my understanding. OpenAI turns around and says, "Okay, we're going to now work with AMD." AMD basically given warrants a big chunk of its company actually. And so I feel like the NVIDIA deal was maybe a little bit better and they're in a stronger position, but still. And then you see Oracle, Oracle got hit the other day because there was a report in the information saying that their gross margins are like 14% on this stuff. And I'm like, "Yeah, well, if you're basically reselling GPUs, that's what's going to happen." That's what happened with Intel. That's clearly what's happening with NVIDIA. Look at gross margins of a company like Dell. So I want to talk about your platform because you're really architecting this platform for efficiency, not just endless excess compute. I think it's Corsair is the name of your platform, correct?
Sid Sheth
>> That's right.
Dave Vellante
>> So you built that purpose built really for inference. So you've got in-memory compute. I think we talked about this, your chiplet architecture, you've got floating point numerics. How do these design choices affect performance per watt? Is that the key metric and how is that affected? And what does that mean for the total cost of these AI factories that are being deployed?
Sid Sheth
>> Yeah, so let's take a step back and peel that onion a little bit. So what are the key metrics that customers care about? As inference is becoming very, very pervasive and inference again is all about return on investment, it's about being efficient, it's about making sure that you have applications that can generate revenue and provide users with a great user experience. And it's all about finding ways to unleash productivity and efficiency matters when you're trying to do all of those things. So what are the key metrics? So one is the perf, performance per dollar. So that is the cost portion. How much performance can I get for a given dollar that I invest? How much performance can I get for a given watt of energy that I have already invested in? And all this in the context of latency or speed. So, yes, it's good to have performance per dollar. It's good to have performance per watt, but what is also important is I provide extremely fast inference to my users, because it is becoming very clear that users want to stay with an application only if they can interact with their application. If it is a truly offline experience where I talk to an AI application and then I have to wait a few minutes or multiple minutes or even tens of minutes to get a response back, most users don't want to hang around that application, so speed is going to matter a lot. Can I interact with this application? Can I do things in real time? How quickly can I do this? And that is just humans interacting with machines, right? Wait till we get to the part where agents come in and machines interact with machines, and then the latency piece becomes even more important because you don't want your machines sitting around waiting for a response. You want this all to be real time. So latency is going to be very, very important in the context of, of course, the other two metrics also, perf per dollar, perf per watt. So these are really the key metrics that every customer I speak with is talking about. That's how they want to measure the efficiency of the platform. And coming back to what we have done at d-Matrix is essentially take those key metrics, and what does one need to do in the hardware? What does one need to do in the software? What does one need to do and how the platform gets deployed at the customer? How do you excel on these three key metrics and what kind of underlying architectural innovations does one have to undertake to get to that point? And that is what we have done at d-Matrix. If you look at perf per dollar cost, everything that we have done in the Corsair platform has been done around cost-effectiveness. We don't want to buy the most expensive memories, and we don't want to use the most expensive packaging technologies, and we don't want to use the most expensive processing technologies and still retain a lot of the benefits, perf per watt. The in-memory computing that you touched upon is all been built to excel and energy efficiency. And the same in-memory computing architecture also allows us to excel a lot on speed and latency because you are essentially keeping compute and memory together and you're keeping the model parameters right in memory where the compute is happening. So you essentially have to make a lot fewer trips to memory and that saves you a lot of time. So that's where we gain in terms of latency. A lot of these decisions that we made at d-Matrix on the Corsair platform and going into the future, we have carried that same pieces and that same underlying foundation will carry on into our future products where it's all about compute and memory integration to attack the memory bottleneck, and then it is about compute and IO bottleneck. So we want to attack that bottleneck as our next problem. We announced a product called JetStream just a couple of weeks ago, which attacks that. So we are essentially going to the core of these bottlenecks and keeping inference really, really efficient.
Dave Vellante
>> So inference becomes this sort of new substrate, this new logic layer if you will. You've got models that are now reasoning, they're thinking. So I think you've used the term decision point, that inference is the decision point inside of data centers where ... And this is where all the agentic reasoning is going to happen. So how should people think about enterprises in particular, think about architecting their data centers and AI factories at scale, if that is a valid premise?
Sid Sheth
>> Yeah, yeah. You're right. So inference is all about decisions. I mean, you have a lot of intelligence out there already, so you have all these trained models, there is a lot of intelligence, but what do you do with all that intelligence? Either you use that intelligence to generate content, but again, that is a decision process. What kind of content do you want to generate? Or you use that intelligence to essentially look at the information the organization has and then you want to make a set of decisions. A lot of this is just about making a lot of decisions. And can you offload decision making from the humans to the machines? And what kind of decisions do you want to offload from humans to the machines? So certain critical strategic decisions can still be made by humans, but a lot of the tactical decisions, humans spend a lot of time making a lot of tactical decisions on a daily basis. Those can be offloaded to the machines. Given a set of boundary conditions, given a set of rules, these machines can go off and make decisions within those confines, and you don't really need humans to make all those decisions. So again, that is going to unleash a lot of productivity, because now you don't need humans to collect all that data, sift through all that data, then have to make decisions with all of that data. Sometimes, decisions can get political and you have to build consensus, and that can take a lot of time. Imagine the amount of productivity that can be unleashed if machines just could do all of that for you. So enterprises will have to essentially find ways to introduce machines or agents into their enterprise workflows so that they can essentially offload that decision-making process. I don't think that has happened yet, by the way, Dave. I think we are still in the early innings of introducing agents, specifically AI, in the form of agents into enterprise workflows where they can go in there and grab all of that data and make intelligent decisions off of that data and introduce all that productivity that we just talked about.
Dave Vellante
>> Sid, when you think about these big data center build-outs from CoreWeaves and the neoclouds, even the cloud players, but specifically xAI and companies like that, they're green fields and so they don't have to rip and replace. You and I have talked about the need to avoid ripping and replacing, and I think that you can augment existing infrastructure, whether it's servers from Dell, HPE, Lenovo, Supermicro, whomever. Can you validate that design philosophy for accelerating adoption? We've seen the x86, the CUDA deal that went down with NVIDIA and Intel, which we saw as a bridge for enterprises and this hybrid architecture, and it sounds like you're taking a similar philosophy. How is that going, and is that sort of the right trajectory in your opinion?
Sid Sheth
>> Yeah, so that's the fundamental crux of our go-to market at d-Matrix. We are building solutions that can be used to augment our customer's fleets of compute. It's not like we build an AI server at d-Matrix or we build racks of our own at d-Matrix. So we don't go into our customer and say, "Look, you have to buy the whole solution from us." In fact, we go into our customers and say, "Look, you tell us what works for you. Who is your partner? Who is your chosen partner? Or here is a partner that we are happy to work with if you like them." So we give the customer the decision-making ability to decide who and which partner they want to work with. We work with multiple partners. Our solutions at the end of the day are accelerator cards. These could be compute accelerators for inference. I talked about the JetStream card, which is an IO accelerator, which allows us to kind of scale out more compute. Now, you can take those cards and you can plug them into any server. We talked about Dell, HP, Lenovo, it could be anybody. Should we work with Supermicro today? And then we can plug those servers into a rack that a customer has. We don't dictate what kind of racks they should build. So it is truly a collaborative effort in terms of how we take our solution to market. And the whole premise here is that we will go into our customer and augment our customer's fleets with our solutions. So they have an existing fleet today. They can augment that fleet for generative AI inference with our solutions, and they can do very efficient AI inferencing using our accelerator cards that we can provide to them. The reason we think that is a much better model for AI inferencing is because, again, inference is not done at any one point in the network. As we stated earlier, inference will be done in big data centers, small data centers, sovereigns, neoclouds. It's a collection of different people that want to use this. Ecosystems are very, very different across all those different customers. So you can't really find a one size fits all solution for inference that you can take to every customer and serve everybody's needs with that single solution. So that is the reason we have created this highly collaborative way of going to market so that we give our customers a broad set of customers choices that they can then decide how they want to use our product in their environment.
Dave Vellante
>> So that gives customers optionality. That gives you go-to-market flexibility. But if I understand it, you also do have a vertical integration play. You can stack memory, you can stack compute, you can stack IO. I think you've talked about building a skyscraper floor by floor. If I want to go that route, I can. What are the advantages? I mean, I get the advantages of optionality. You can pick and choose, but are there clear monetizable benefits for the customer of going that vertical integration route?
Sid Sheth
>> Yeah. So Dave, I think that's a very good question. Just one clarification. When we talk about the vertical integration, we are talking about vertically integrating silicon. So memory and compute, right? We talked about memory and compute integration, and the vertical solution that we have created is around putting more memory on top of compute and building that skyscraper so that we can always increase the capacity of memory that we add to that compute. The optionality that I was referring to is really about once you already have these cards, so up to the card level, we build all the silicon, we do the vertical integration, we build the software, we do all of that. That's d-Matrix. We take that card. Once that card is built and there is a lot of innovation that goes onto that card, that's where all the in-memory compute and the 3D stacking and all that stuff happens. But once we take that card to market, that card can plug into any server from multiple different methods. That's where the optionality starts for the customer. So the customer is still telling us, "Hey Sid, we love what d-Matrix is doing. You have a lot of fantastic innovation on your accelerator cards. We want all of that. But then once we take that card, give us the optionality to plug that card into any server of our liking because we have our chosen server partners or we have our chosen server rack partners, or this is the way we connect up all the servers in our data centers." So that's the part we would like to have the flexibility to decide how we put all of that together, but you put that card together.
Dave Vellante
>> Got it. Makes total sense. Sid, thank you so much. I mean, look, training is building these big mega data centers and that's been powering the AI wave for the past several years. Like you said, training is not going to go away. Inference, talk about stack, we're going to stack that on top and reasoning, and it's just going to grow the TAM. I really appreciate your time, Sid, and always a pleasure having you on. Hope we can have you face to face here in NYSE.
Sid Sheth
>> I would love to do that, Dave. I'd love to do that. So look forward to that and wonderful chatting with you.
Dave Vellante
>> Great, thank you. Okay, and thank you for watching. This is Dave Vellante for the NYSE Wired and theCUBE's AI Factories - Data Center of the Future series. We'll be right back right after this short break.
>> Hi everybody, welcome back to the New York Stock Exchange. This is the NYSE Wired plus theCUBE's AI Factories - Data Center of the Future, and we're super excited to have Sid Sheth here. He's the Founder, President, and CEO of d-Matrix. Sid, thanks for coming on. Good to see you again.
Sid Sheth
>> Yeah. No, thanks for having me, Dave. It's always wonderful to chat with you.
Dave Vellante
>> So last time, we chatted was June, you were in our Palo Alto studio for our AI and robotics leaders programming. Not much has happened since June, huh?
Sid Sheth
>> I can barely keep up, Dave. I can barely keep up.
Dave Vellante
>> It's incredible. Jensen was in the exchange yesterday. Michael Dell was here. Of course, AI factories is a big theme. When we talked last, we talked about the shift from training to inference. Now, everybody is talking about it and you said at the time training is a performance problem. Inference is really all about efficiency, performance per watt, cost. And so I wonder if you could sort of talk about how that's, first of all, come to the frontal lobe of most people's minds here and how that shift is changing the design of data centers and edge. What does it mean for the next generation AI?
Sid Sheth
>> Yeah. Yeah. I think when we spoke in June, this was post the DeepSeek moment. And I think my point of view at the time was we have officially entered the age of inference. And I think all these announcements that you've seen in the last six months are further validation of the fact that the age of inference is truly here and everyone is trying to play catch-up because so much time and money and resources has been spent, call it over the last 10 plus years on training larger and larger AI models and making them more capable. And then you have this watershed moment in the early part of this year where you realize that we have reached a point in the capability of these models where you can now take them and make them really efficient. You can distill them down, make them smaller, and they're still as capable and there is a lot more people who can use this, not just a few people. So I think you're just seeing a cascade of events after that DeepSeek movement in January of 2025 that has catalyzed the age of inference and now it is all about deploying a lot of compute to serve the needs of humanity. And that is going to require so much compute that people are just trying to put their arms around, "Wow, how much inference in compute do I need to run to really serve all of humanity and serve all the applications I want to build?"
By the way, the training still continues. It's not like training comes to an end. People are still wanting to train bigger models and get to AGI, so that quest continues. So at this point, you're looking at these mega, mega announcements, hundreds of billions of dollars of compute being deployed to serve both the training needs of the future and, of course, the exponential inferencing needs of the future.
Dave Vellante
>> We asked Michael Dell, did you see any signs of demand? I mean, are you sleeping with one eye open? Of course, he's seen bubbles burst before and he said absolutely no end in sight for this demand. It just seems to be insatiable. So people, I'm sure, yourself, you're looking for those signs, but people talking about the circular reference and taking on a little bit more debt, but it really does feel like the mid-part of the 1990s, '96, '97 than it does '99, doesn't it?
Sid Sheth
>> Yeah, everyone is trying to put their arms around where we are in the cycle. And I think, again, I would warn against comparing to the past, right? Because typically, the way the internet bubble rolled out and there were certain unique set of events that happened leading to what you call the dot-com crash in the late '99, early 2000. It's not like we are seeing similar patterns of behavior, because obviously, the promise of the opportunity is just so large, everyone wants to get in on it, but I would caution against trying to draw too many parallels to what has happened in the past, because that was a different point in time. Capital needs were very different. Availability of capital was very different. Towards the end of the decade, we were in a rising interest rate environment, not a falling interest rate environment. So yeah, you could be right. I mean, it doesn't certainly seem like we are close to the end here. Inference is certainly just getting started. We are in the very early innings of inference compute, the deployment of AI, the part of AI where people are going to find a way to recoup their massive investments and actually build applications and introduce productivity. So I think that is still very, very early, so it does feel like we are not anywhere close to the end right now. And to me, I would agree. I think I don't see an end in sight yet and lots more to do actually. Lots more to do as an industry, as the amount of work that needs to be done to get all of this rolled out, so I think there's a lot more to go.
Dave Vellante
>> I think that's good cautions. The past is not prologue and you've got to be there. You've got to be there investing. This is a wave that you want to ride. So do you think that the AI infrastructure is going to bifurcate? I mean, we've talked about this that only a handful of companies are going to be able to go after the million GPU clusters, the mega clusters and chase AGI, but most of the market's going to be focused on your space, cost-efficient, your deployable models. DeepSeek, as you said, was a real watershed moment. How do you think that bifurcation, if you agree with that premise, is going to shape the data center landscape, the edge landscape? Are we going to see sort of training factories and inference factories? Do those two worlds come together in some way? What are your thoughts on that?
Sid Sheth
>> Yeah. That's a great question. You're going to see a lot more compute. I think that's the underlying theme here, because our thesis at d-Matrix was always that inferencing compute is going to be 100 times more, 10 to 100 times more than training compute. So whatever you've seen being built so far has been primarily been done for training compute. And what you're beginning to see now in 2025 is these mega announcements, a lot of them are around inferencing compute. So people are beginning to look forward and say, "Wow, with the availability of these reasoning models and with the availability of the capabilities of these models just improving exponentially, you can do a lot more agentic capabilities becoming more readily available." The amount of compute we're going to need is just 10 to 100 times more than what we did on training. So my god, what does an infrastructure build for that opportunity really look like? Do these two coexist? I think the two coexist. I think there's, again, the dynamics has not changed. There's four or five companies in the world who can really afford to go on the AGI journey, and that is where you need 100,000 GPU clusters going to 500,000 GPU clusters going to call it a million GPU clusters sometime in the future. This openAI announcement with NVIDIA and AMD is about getting to that point where you are deploying a 100,000 to 500,000 GPU clusters so that you can train the next big model, unleash a new set of capabilities. Because every time you ... The hope is that every time you launch a new frontier model, you launch a new set of capabilities, new set of intelligence, new levels of reasoning, and that will happen with GPT-6 whenever that comes out. And that will then unleash a whole new set of capabilities and smaller models will come out of that and a lot more inferencing will happen because of that. So I think the tool will coexist. You'll have very large factories for training, which will be owned by a few folks because it's so capital intensive, but there'll be many, many, many more inferencing factories. And inferencing is going to be done at many, many different points in the network. There's going to be mega factories, there's going to be edge factories, there's going to be small computing nodes, large computing nodes. The hyperscalers, the new clouds, the sovereigns, everybody needs inferencing compute. So I think that is just going to look a lot more spread out and a lot more distributed. The training factories are a lot more consolidated relatively speaking compared to inference.
Dave Vellante
>> Yeah, the deals that are going down with OpenAI are just mind-boggling. First of all, the NVIDIA investment. I mean, Jensen has said, "I wish I could have invested earlier." And then AMD turns around and there's no strings attached, is my understanding. OpenAI turns around and says, "Okay, we're going to now work with AMD." AMD basically given warrants a big chunk of its company actually. And so I feel like the NVIDIA deal was maybe a little bit better and they're in a stronger position, but still. And then you see Oracle, Oracle got hit the other day because there was a report in the information saying that their gross margins are like 14% on this stuff. And I'm like, "Yeah, well, if you're basically reselling GPUs, that's what's going to happen." That's what happened with Intel. That's clearly what's happening with NVIDIA. Look at gross margins of a company like Dell. So I want to talk about your platform because you're really architecting this platform for efficiency, not just endless excess compute. I think it's Corsair is the name of your platform, correct?
Sid Sheth
>> That's right.
Dave Vellante
>> So you built that purpose built really for inference. So you've got in-memory compute. I think we talked about this, your chiplet architecture, you've got floating point numerics. How do these design choices affect performance per watt? Is that the key metric and how is that affected? And what does that mean for the total cost of these AI factories that are being deployed?
Sid Sheth
>> Yeah, so let's take a step back and peel that onion a little bit. So what are the key metrics that customers care about? As inference is becoming very, very pervasive and inference again is all about return on investment, it's about being efficient, it's about making sure that you have applications that can generate revenue and provide users with a great user experience. And it's all about finding ways to unleash productivity and efficiency matters when you're trying to do all of those things. So what are the key metrics? So one is the perf, performance per dollar. So that is the cost portion. How much performance can I get for a given dollar that I invest? How much performance can I get for a given watt of energy that I have already invested in? And all this in the context of latency or speed. So, yes, it's good to have performance per dollar. It's good to have performance per watt, but what is also important is I provide extremely fast inference to my users, because it is becoming very clear that users want to stay with an application only if they can interact with their application. If it is a truly offline experience where I talk to an AI application and then I have to wait a few minutes or multiple minutes or even tens of minutes to get a response back, most users don't want to hang around that application, so speed is going to matter a lot. Can I interact with this application? Can I do things in real time? How quickly can I do this? And that is just humans interacting with machines, right? Wait till we get to the part where agents come in and machines interact with machines, and then the latency piece becomes even more important because you don't want your machines sitting around waiting for a response. You want this all to be real time. So latency is going to be very, very important in the context of, of course, the other two metrics also, perf per dollar, perf per watt. So these are really the key metrics that every customer I speak with is talking about. That's how they want to measure the efficiency of the platform. And coming back to what we have done at d-Matrix is essentially take those key metrics, and what does one need to do in the hardware? What does one need to do in the software? What does one need to do and how the platform gets deployed at the customer? How do you excel on these three key metrics and what kind of underlying architectural innovations does one have to undertake to get to that point? And that is what we have done at d-Matrix. If you look at perf per dollar cost, everything that we have done in the Corsair platform has been done around cost-effectiveness. We don't want to buy the most expensive memories, and we don't want to use the most expensive packaging technologies, and we don't want to use the most expensive processing technologies and still retain a lot of the benefits, perf per watt. The in-memory computing that you touched upon is all been built to excel and energy efficiency. And the same in-memory computing architecture also allows us to excel a lot on speed and latency because you are essentially keeping compute and memory together and you're keeping the model parameters right in memory where the compute is happening. So you essentially have to make a lot fewer trips to memory and that saves you a lot of time. So that's where we gain in terms of latency. A lot of these decisions that we made at d-Matrix on the Corsair platform and going into the future, we have carried that same pieces and that same underlying foundation will carry on into our future products where it's all about compute and memory integration to attack the memory bottleneck, and then it is about compute and IO bottleneck. So we want to attack that bottleneck as our next problem. We announced a product called JetStream just a couple of weeks ago, which attacks that. So we are essentially going to the core of these bottlenecks and keeping inference really, really efficient.
Dave Vellante
>> So inference becomes this sort of new substrate, this new logic layer if you will. You've got models that are now reasoning, they're thinking. So I think you've used the term decision point, that inference is the decision point inside of data centers where ... And this is where all the agentic reasoning is going to happen. So how should people think about enterprises in particular, think about architecting their data centers and AI factories at scale, if that is a valid premise?
Sid Sheth
>> Yeah, yeah. You're right. So inference is all about decisions. I mean, you have a lot of intelligence out there already, so you have all these trained models, there is a lot of intelligence, but what do you do with all that intelligence? Either you use that intelligence to generate content, but again, that is a decision process. What kind of content do you want to generate? Or you use that intelligence to essentially look at the information the organization has and then you want to make a set of decisions. A lot of this is just about making a lot of decisions. And can you offload decision making from the humans to the machines? And what kind of decisions do you want to offload from humans to the machines? So certain critical strategic decisions can still be made by humans, but a lot of the tactical decisions, humans spend a lot of time making a lot of tactical decisions on a daily basis. Those can be offloaded to the machines. Given a set of boundary conditions, given a set of rules, these machines can go off and make decisions within those confines, and you don't really need humans to make all those decisions. So again, that is going to unleash a lot of productivity, because now you don't need humans to collect all that data, sift through all that data, then have to make decisions with all of that data. Sometimes, decisions can get political and you have to build consensus, and that can take a lot of time. Imagine the amount of productivity that can be unleashed if machines just could do all of that for you. So enterprises will have to essentially find ways to introduce machines or agents into their enterprise workflows so that they can essentially offload that decision-making process. I don't think that has happened yet, by the way, Dave. I think we are still in the early innings of introducing agents, specifically AI, in the form of agents into enterprise workflows where they can go in there and grab all of that data and make intelligent decisions off of that data and introduce all that productivity that we just talked about.
Dave Vellante
>> Sid, when you think about these big data center build-outs from CoreWeaves and the neoclouds, even the cloud players, but specifically xAI and companies like that, they're green fields and so they don't have to rip and replace. You and I have talked about the need to avoid ripping and replacing, and I think that you can augment existing infrastructure, whether it's servers from Dell, HPE, Lenovo, Supermicro, whomever. Can you validate that design philosophy for accelerating adoption? We've seen the x86, the CUDA deal that went down with NVIDIA and Intel, which we saw as a bridge for enterprises and this hybrid architecture, and it sounds like you're taking a similar philosophy. How is that going, and is that sort of the right trajectory in your opinion?
Sid Sheth
>> Yeah, so that's the fundamental crux of our go-to market at d-Matrix. We are building solutions that can be used to augment our customer's fleets of compute. It's not like we build an AI server at d-Matrix or we build racks of our own at d-Matrix. So we don't go into our customer and say, "Look, you have to buy the whole solution from us." In fact, we go into our customers and say, "Look, you tell us what works for you. Who is your partner? Who is your chosen partner? Or here is a partner that we are happy to work with if you like them." So we give the customer the decision-making ability to decide who and which partner they want to work with. We work with multiple partners. Our solutions at the end of the day are accelerator cards. These could be compute accelerators for inference. I talked about the JetStream card, which is an IO accelerator, which allows us to kind of scale out more compute. Now, you can take those cards and you can plug them into any server. We talked about Dell, HP, Lenovo, it could be anybody. Should we work with Supermicro today? And then we can plug those servers into a rack that a customer has. We don't dictate what kind of racks they should build. So it is truly a collaborative effort in terms of how we take our solution to market. And the whole premise here is that we will go into our customer and augment our customer's fleets with our solutions. So they have an existing fleet today. They can augment that fleet for generative AI inference with our solutions, and they can do very efficient AI inferencing using our accelerator cards that we can provide to them. The reason we think that is a much better model for AI inferencing is because, again, inference is not done at any one point in the network. As we stated earlier, inference will be done in big data centers, small data centers, sovereigns, neoclouds. It's a collection of different people that want to use this. Ecosystems are very, very different across all those different customers. So you can't really find a one size fits all solution for inference that you can take to every customer and serve everybody's needs with that single solution. So that is the reason we have created this highly collaborative way of going to market so that we give our customers a broad set of customers choices that they can then decide how they want to use our product in their environment.
Dave Vellante
>> So that gives customers optionality. That gives you go-to-market flexibility. But if I understand it, you also do have a vertical integration play. You can stack memory, you can stack compute, you can stack IO. I think you've talked about building a skyscraper floor by floor. If I want to go that route, I can. What are the advantages? I mean, I get the advantages of optionality. You can pick and choose, but are there clear monetizable benefits for the customer of going that vertical integration route?
Sid Sheth
>> Yeah. So Dave, I think that's a very good question. Just one clarification. When we talk about the vertical integration, we are talking about vertically integrating silicon. So memory and compute, right? We talked about memory and compute integration, and the vertical solution that we have created is around putting more memory on top of compute and building that skyscraper so that we can always increase the capacity of memory that we add to that compute. The optionality that I was referring to is really about once you already have these cards, so up to the card level, we build all the silicon, we do the vertical integration, we build the software, we do all of that. That's d-Matrix. We take that card. Once that card is built and there is a lot of innovation that goes onto that card, that's where all the in-memory compute and the 3D stacking and all that stuff happens. But once we take that card to market, that card can plug into any server from multiple different methods. That's where the optionality starts for the customer. So the customer is still telling us, "Hey Sid, we love what d-Matrix is doing. You have a lot of fantastic innovation on your accelerator cards. We want all of that. But then once we take that card, give us the optionality to plug that card into any server of our liking because we have our chosen server partners or we have our chosen server rack partners, or this is the way we connect up all the servers in our data centers." So that's the part we would like to have the flexibility to decide how we put all of that together, but you put that card together.
Dave Vellante
>> Got it. Makes total sense. Sid, thank you so much. I mean, look, training is building these big mega data centers and that's been powering the AI wave for the past several years. Like you said, training is not going to go away. Inference, talk about stack, we're going to stack that on top and reasoning, and it's just going to grow the TAM. I really appreciate your time, Sid, and always a pleasure having you on. Hope we can have you face to face here in NYSE.
Sid Sheth
>> I would love to do that, Dave. I'd love to do that. So look forward to that and wonderful chatting with you.
Dave Vellante
>> Great, thank you. Okay, and thank you for watching. This is Dave Vellante for the NYSE Wired and theCUBE's AI Factories - Data Center of the Future series. We'll be right back right after this short break.