theCUBE + NYSE Wired: Physical AI & Robotics Leaders | Sid Sheth, d-Matrix

Clips
More from theCUBE + NYSE Wired: Physical AI & Robotics Leaders

Sid Sheth

CEO

d-Matrix

play_circle_outline DeepSeek's role in emphasizing the efficiency of smaller, distilled AI models.

play_circle_outline d-Matrix's architecture focuses on co-locating compute and memory for efficiency.

play_circle_outline Revolutionizing AI Inference: Corsair Accelerator Card and d-Matrix Transform Generative AI with Scalable Chiplet Architecture

play_circle_outline Seamless Integration: d-Matrix's Customer Deployment Strategy for Cost-Effective Inference Solutions in Diverse Data Center Operations

play_circle_outline Key applications targeted: reasoning models, interactive LLMs, and real-time video generation.

Info
Transcript

Sid Sheth, d-Matrix

Sid Sheth

CEO d-Matrix

Sid Sheth, chief executive officer of d-Matrix, joins theCUBE’s John Furrier and Dave Vellante during theCUBE + NYSE Wired: Robotics & AI Infrastructure Leaders 2025 event to unpack the role of in-memory computing in powering efficient AI inference. The conversation centers on Corsair, d-Matrix’s flagship chip architecture designed for low-latency, high-efficiency workloads.

Sheth shares how combining compute and memory dramatically reduces energy consumption and improves throughput across inference-heavy tasks such as interactive LLMs and video genera... Read more

explore Keep Exploring

What significant development followed Supercompute and how has it influenced the approach to building models for enterprise applications? add

What innovations does d-Matrix implement to improve efficiency in computing and memory architecture? add

What innovations and key features are associated with the Corsair platform developed by d-Matrix? add

What is the value proposition of the d-Matrix solution in relation to existing AI server infrastructure? add

What use case patterns are emerging for your startup's applications? add

bolt Powered by CUBE AI

Sid Sheth, d-Matrix

search

>> Welcome back over to theCUBE's live coverage here in the studio. I'm John Furrier with Dave Vellante. We've got a great lineup, three days of wall-to-wall coverage, Robotics and AI Leaders where we're talking about the chips, the software, of course, the energy, power consumption, the data layer. We had entrepreneurs, CEOs, investors in here. Sid Sheth is here. He is the CEO of d-Matrix, theCUBE alumni. Sid, great to see you. Great to see you last night at the event.

Sid Sheth

>> Yeah, it was wonderful to be there.

>> You guys are in the middle of the action. The pressure is on, the market's waiting for the Moore poor performance Dave and I on our podcast called The Red Zone, like in football, the infrastructure players are close to punching it in and just it's going to be a release of more horsepower, just energy. So congratulations and great to see you again.

Sid Sheth

>> Yeah, thank you.

>> First question is where are we? What's new? What have you shipped?

Sid Sheth

>> Yeah, I think we spoke, was it summer of last year?

>> Yeah.

Sid Sheth

>> So it's been a year, and I think we spoke again at Supercompute and that was the show at which we announced our first product, a product called Corsair. And this is a accelerator card with inference silicon on it that we have built. It took us a few years to kind of build out that silicon. We have done a lot of groundbreaking world's first type of stuff in that silicon. We can talk about that. But the product is not shipping to customers in pilots with some very large customers. And our plan is that we'll be in what we call general availability of GA by later this year and start ramping the product in 2026.

>> Yeah, I remember the conversation at Supercomputing. We kind of teased it out. I think we even said a year ago, inference. We hit that up and then we hit it hard in Supercomputing, Supercomputing is coming up again fast. We're already getting... theCUBE's got a bigger booth, it's going to be bigger than ever. GTC at NVIDIA gets the market all hyped up. Everybody wants faster stuff. Inference in particular has gotten everyone's attention, and then the agent wave has kicked in, which is going to put more pressure on more tokens, more action. So with reasoning, and now you've got evaluations, you start to see agents getting some visibility into where that goes, more data coming in. So this is really a demand curve that's hitting, so you guys are on the right wave.

Sid Sheth

>> Yeah.

>> What's changed since Supercomputing on the inference side? Is there more awareness? What would you say would be the market factor or product-specific feature you're vectoring into on the inference side?

Sid Sheth

>> It's a great question. So I think we've been saying before 2025 that the world is bifurcating. I mean, for the last 10 years, a lot of the conversation has been all about training, training bigger models, and it has actually worked pretty well. The bigger the model gets, the more capable it becomes. So everyone was on that trajectory like, "Hey, let's just train bigger and bigger models. That means buy more GPUs and build bigger clusters." And somewhere along the way, I think it became pretty clear that this is not really an affordable trajectory for a lot of people. Yes, you can deploy all that CapEx, where is the return? Yes, maybe a few companies who are in the pursuit of AGI, that makes a lot of sense for them. But what about the rest? Not everyone is in the pursuit of AGI. A lot of folks want to figure out what they're going to do with all these AI models that are becoming available that have different degrees of capability. So the world started moving in two directions. There was one set of guys, very few, probably less than five, who are down the path of getting to AGI. And those folks will build big clusters, buy a lot of GPUs, build big models, frontier models. But then there's other section, vast majority, which is really focused on deploying models that are cost-efficient, deploying models that are capable enough to get what they need to get done. And this is something we've been saying for a while, that there is going to be a very large segment of the market that will want that. That happened. The big thing that happened after Supercompute was DeepSeek. And that happened in January and that really kind of put that conversation to the forefront, which is you don't really need to spend a lot of money to build a capable model. You can take big models and distill them down to smaller models, and those models are perfectly capable to do a lot of different tasks. And I think that has kickstarted a whole movement around which people are like, "I think we are going to use that class of models to build applications with and deploy them in vast majority of the enterprise applications that don't really need those massive frontier models."

>> And also you mentioned, so couple things I want to ask you because, one, we know that the demand for tokens is forcing a reset on architecture around bigger clusters. I would put them in the category of mega clusters for the trainings with maybe big workloads, and then I would say large scale clusters for normal data centers. This is another trend we're seeing. So now you're starting to see the use cases. In some cases you don't need a two nanometer, you don't need a six nanometer IoT when hearing that here. But also, DeepSeek points out that the software innovation could deal with the constraints of the market. What are some of those constraints, and that's maybe a bad word, what's the enablement that you're enabling software people to work around? Because constraints that simply, it takes time to ramp up the next gen, so I got to work with what I got and then I'll take advantage of the next generation, but the software is where the action is

Sid Sheth

>> Correct.

>> And so with good software and good energy management practices, you can have a data center serving up massive inference.

Sid Sheth

>> Right, right.

>> Your reaction to that?

Sid Sheth

>> So, I think first of all, we got to break the training problem and the inference problem, and even training, pre-training, as you said, mega clusters, large clusters, you need lots of compute, many different racks of compute to actually get training a task completed depending on how large the model it is that you're trying to train. When it comes to inference, everything that we are seeing, these clusters are nearly not as large. We are looking at one rack, two rack, maybe eight rack at the best. Most of the vast majority of the inference tasks are really run on much, much smaller clusters, which obviously scales down the software problem. The other thing is inherently inference from a software perspective is easier compared to say a pre-training tasks where you are exchanging a lot of information between various GPU nodes and keeping everybody in sync. There's a lots of orchestration that needs to happen, lots of networking that you need to keep working in sync to get that task done. When it comes to inference, it's really you don't do any of that. It's really shoving data through the compute and all you're doing is quickly making decisions. It's a throughput play and latency play, so you want to do all of that very quickly because users are trying to use this stuff. So the software problem is different for inference. It's nearly not as complex as it is for training. Now, the things that we have done at d-Matrix to make it more attractive for users to do more with our hardware is it's all around efficiency, because inference is about efficiency. I like to say this, training is a performance problem, inference is an efficiency problem, and you really can't take a training solution and put it into an inferencing solution and hope that you'll get the same efficiency. You've got to build something from the grounds up to be really, really efficient. And that's what we have done is build something from the grounds up, and the software has also been built from the grounds up for inference. So things like using different kind of numerical formats that are very inference-friendly, we call those block floating point numerics, those have now become a standard. I mean, pretty much Microsoft went out and standardized on that called MX formats. The whole industry has formed a consortium behind it. Pretty much every silicon that is trying to do inference uses that, but that allows you to extract more efficiency from the workloads and just do inference more efficiently.

Dave Vellante

>> Can you explain the architecture a bit more? Because we think of monolithic chips, big-ass RAMs, and then more efficient chiplet architectures. I think you're the latter, although you do use large SRAMs, and my understanding is blocks of SRAMs.

Sid Sheth

>> Yeah.

Dave Vellante

>> Can you explain the architecture and the difference from sort of what we consider a monolith? I can think of NVIDIA as big chips. Explain the difference.

Sid Sheth

>> Yeah. So, I mean, the way we like to put it is there is the von Neumann world, the von Neumann architecture. Traditional von Neumann architecture is you have compute, you have memory, and you spend a lot of time between going from compute to memory, and data has to transfer, you spend time and energy. What we did at d-Matrix, the core innovation, was we said how would we just co-locate the compute and memory? So vast majority of our silicon is these in-memory computing engines that are essentially memory where you can keep the model parameters, but the computation is happening in memory. So you are saving all that energy and time going back and forth to that first tier of memory. You don't need to do that anymore, and that gives a lot of efficiency when it comes to energy and latency. So I think that's one of the core innovations at d-Matrix.

Dave Vellante

>> And the Corsair platform, what are the salient aspects of it?

Sid Sheth

>> The salient aspects of Corsair are first of all, it's a true generative AI inference platform built for that application. This is not something that was built for something else and then we said, "Oh, inference is the new hot market. Let's go pivot this thing into that application." No, the company was started to build a solution for inference in 2019, and Corsair was a product that has been in the making for almost three to four years with an emphasis on generative AI inference. So things like building it with chiplets as opposed to building it with very large chips. And the reason we did that was we said, "Inference is not a one-size-fits-all problem. People want to do inference on big workloads, small workloads, big data centers, small data centers. You can't really build one solution and hope that's going to work everywhere." Build chiplets, and you can scale it up and scale it down using things like in-memory compute, which is now becoming very popular for inference, but that is an inference-dedicated compute engine using things like in-memory compute. Using things like we touched upon, block floating point numerics or numerics for inference, all this stuff is built into Corsair. Connecting these chiplets in all-to-all configuration, putting that all on very inexpensive packaging, building a 600-watt PCI card that has dense collection of chiplets that has a lot of compute and memory. So it's one of those things, it's a recipe, I like to call it a recipe where it has a number of these ingredients that have been carefully put together to bake a really good cake, and you really got to build this from the ground up. You can't just think of this as an afterthought.

>> It's got to be near the compute too, the compute-hungry aspect of inference is another factor.

Sid Sheth

>> Correct.

>> How are people deploying? How do you see the use cases unfolding? Can you share the thoughts around what this will look like in an environment for a customer? How are they looking at deploying this? Are they changing their rack configurations? Obviously, the card. What's some of the specifications?

Sid Sheth

>> So the beauty of the d-Matrix solution is that we do not expect our customers to change anything. We go in with accelerator card and the way we build our accelerator cards is it can plug into a server that is built by, there's a number of AI server folks out there, whether it's the super micros, Dells, HPs, Lenovos. There's a whole bunch of these folks who are building AI servers, so we don't want to reinvent that part of the wheel. We are saying, "Okay, that's been invented. Let's augment, let's find a way to augment that infrastructure that's already out there with our solution so that you can do generative AI inference and do it in a way where you completely change the cost economics of doing it, you completely change the user experience that you can give your users." That is a big part of the value prop of what d-Matrix does. We never wanted to go in there with a full solution and expect our customers to say, "Oh my God, I have to rip and replace the stuff I have to accommodate you." No, that's not what we want to do. We want to go in there and find a way to live in our customer's environment and augment it with what we have built. And that is something that obviously broadens the footprint of how many customers we can serve because there's lots of folks with existing infrastructure that they don't want to throw away and are building new infrastructure that looks a lot like the old infrastructure, or at least is being upgraded. But we fit into any of those. We plug into greenfield infrastructure, we can plug into brownfield infrastructure, but that really opens up the market for us.

Dave Vellante

>> What are you seeing as the use case patterns that are emerging?

Sid Sheth

>> And that's a great question. So, we as a startup, we really don't want to go off and boil the ocean. We can't. We have to really go focus on where we are really good, so we have to pick a spot and really be really good at that. In our case, we have picked three applications that we really want to go after. We are very excited about agentic and reasoning models because reasoning models require a lot of compute, but they require extremely low-latency compute because essentially you are specifying a task and high-level construct. You want the agents to understand all of that. They use all the reasoning models behind the scenes to kind of break down the task, collaborate with other agents, so lots of chains of thought, lots of compute, but you want the turnaround time to be very quick. I mean, if you give that same task to a group of humans and they take three days, you want AI to wrap it up in maybe, hopefully three minutes, maybe three hours, but no more. So you need very low-latency turnaround and lots of compute. So that's a great application for d-Matrix, because we do that well. Second one is anything that's highly interactive LLMs, so interactive coding agents is a great example. Advanced coding agents where you have very sophisticated programmers that are working with AI to program something, but you are doing a lot of advanced coding in an extremely interactive way. So that's in application number two. And application number three that we are very excited about is video generation, interactive video generation. So you have video generation today, but it's really an offline experience. You prompt a model, you go away for some time, you come back, a video is there, you don't like it, you re-prompt it, go away. That will not work longer term. What if you had a situation where you could prompt a model in real-time? Said, "Show Dave in front of the Eiffel Tower." And oh, by the way, the video is getting created as you're prompting. And I said, "But I don't like the clear blue sky. Maybe throw some clouds." And all this is happening as... It's almost like how you're conversing with a text model today, the same thing you could do with video. And that is not possible today. And that would open up a lot of different use cases like Hollywood could make movies very quickly. I mean, influencers could make videos very quickly. So I think that's another thing that we want to try and unlock with our-

Dave Vellante

>> And it's your architecture and the in memory and real-time high performance that allows you to do that?

Sid Sheth

>> Yeah, that's right. It's basically we have created an architecture where, again, going back to that compute and memory integration that we have done, co-locating compute and memory. Instead of packaging, compute and memory, the traditional architectures, again, as I was saying, are kind of planar. They go side by side. In our case, we co-locate or go high, so we are essentially going 3D. These approaches are very new. We pioneered those approaches and it unlocks a lot of memory bandwidth, which is really the choke point for a lot of these interactive applications.

>> Memory is huge. Where's the use cases? You see the growth coming from a CEO, you got the product roadmap, you're ramping up. What's the focus on the growth strategy? What's the plan?

Sid Sheth

>> So, again, going back to the three that we are focusing on, interactive LLMs, reasoning agents, and video generation, interactive video generation, it's going to come in different waves. I think wave number one is already here. I think the interactive LLM wave is here. Folks want to do more with LLMs, but they want to do it really, really fast, ultra fast, and that's where we can really help. So I think that's going to be the first wave. To me, the second wave is going to be reasoning and agents, and that's coming in probably another year, year or two. And then after that, you'll see the video generation wave that'll come right up.

>> And who's buying the product? Who's the target customer?

Sid Sheth

>> So we would work typically with data center operators. So we work with all classes of data center operators. We work with big hyperscalers, we work with what they call the neo clouds, we work with the sovereign clouds, we also work with the enterprise clouds. So these are on-prem deployment. So it's four buckets of customers that we go after, but yeah, anybody who has an application-

>> You're like inference is becoming the decision point logic gate, for lack of a better word, because now say sovereign cloud for instance, you can run an inference server in a country to make the calls around how to handle things or agents. Is that right? Am I getting that right? Or how would you describe the value here? Because I mean, it's infrastructure, but you're enabling this new inference engine.

Sid Sheth

>> Correct.

>> Because inference is decisions you infer off your training.

Sid Sheth

>> That's right. That's right. And decisions based on certain types of data. So we are going multimodal now. So the world so far has been primarily text-based. We talk a lot about LLMs and language, and that's all text, but that's not where we are going over the next five years. Over the next five years, this is not just text anymore. It is video and audio and images and everything. So now that things are going multimodal and models are becoming natively multimodal, they're being trained on multimodal data, you're making decisions on different kinds of content. You are not only making decisions anymore, you are also generating new content based on content that the model was trained on, so I think content of the model has never seen before. So you're creating content based on that.

>> So I have to ask you on the roadmap, because I know the cycles are different. Obviously, you're making a great product, you're integrating the secret sauce in there with memory and compute, you have to have visibility into multiple generations. And again, the constraints will be out there and the software will get around it. What's the roadmap look like for you guys? Do you guys share that? Do you talk about what's next? What's your focus? Is it smaller devices, smaller chips?

Sid Sheth

>> Yeah.

>> What's the focus on the roadmap?

Sid Sheth

>> So our roadmap is, again, predicated on that simple idea of putting memory and compute together. And we started, we like to call it, we are building a skyscraper and we have our first-generation product that is out in the market, Corsair, that's the ground floor. So we have laid the ground floor, the foundation and the ground floor has been put in place, and it is a planar product, we have in-memory computing engines, but if you look at the way the product has been built out as a collection of chiplets on a board. So we have laid the ground, we have laid the foundation, and we have put in multiple houses in place, multiple buildings in place. I think our roadmap is pretty simple. We are going to start stacking memory on top of that compute, and we are just going to keep going higher. So the next product is where we put in the first floor where we'll be taking memory and stacking it directly on top of compute. And so we have this in-memory compute engine, now we have memory sitting on top. And why are we doing that? So that we add more capacity to the platform while retaining our memory bandwidth advantage, because by stacking memory directly on top of compute, you're still holding onto that memory bandwidth advantage while adding more capacity. And then after that, the products after that will keep adding more floors to the skyscraper, and it's fundamentally a different way of building an inference computing solution.

Dave Vellante

>> What is the engineering trade-off that you make when you go into that approach? I mean, there's always, you're making decisions, you go in engineering, you come down a path and you say, "Okay, we got to go this way or that way." So when you decide that on that architecture versus whatever alternative, whether it was... Clearly, you weren't going to go monolithic, but what's trade-off there? The challenge?

Sid Sheth

>> So mean the-offs primarily when you start doing something different is the ecosystem has to mature around the product that you're trying to put together. So we were very, very careful about making sure that the technology choices that we made were mature enough where we could build a product with that approach. So we've been spending a lot of time with a lot of our ecosystem suppliers and partners and making sure that this has been completely de-risked. And we have spent two years, past two years, the company has been spending time on de-risking systematically. Not that we are taking all the de-risk responsibility, this has been something that has already been addressed in the past by other applications like crypto mining, but that is one of the big things. The other thing is there was a deliberate reason we made those choices. We wanted to stay away from the HBM ecosystem, because the HBM ecosystem is heavily dominated by one big GPU vendor, and we would never get access to the best-in-class HBM memory, so we kind of wanted to find a way to get a solution that's got capacity and high memory bandwidth that is not HBM.

Dave Vellante

>> Right, so you have your own supply chain.

Sid Sheth

>> Correct. Exactly. But at least we're not dealing with supply constraints, supply creation, but not supply constraints.

Dave Vellante

>> That's a type A error.

Sid Sheth

>> Yeah.

>> Well, Sid, it's great to have you on. Congratulations on the ramp up and .

Sid Sheth

>> Thank you.

>> You're bringing inference to the average data center which wants more infrastructure, so you're going to actually increase the value versus the mega, mega data centers.

Sid Sheth

>> That's the hope and that's the plan, yes.

>> All right. Check out d-Matrix. Again, the innovation is coming at the infrastructure. The pressure is on because the demand for apps and business logic and agents are waiting, and d-Matrix among others here and their opinions here on theCUBE Robotics and AI Leaders Program. Dave Vellante, I'm John Furrier. Thanks for watching.