theCUBE + NYSE Wired: AI Factories - Data Centers of the Future | Preet Virk, Celestial

Clips
More from theCUBE + NYSE Wired: AI Factories - Data Centers of the Future

Preet Virk

Co-Founder, COO

Celestial

play_circle_outline Growth of AI models necessitating larger GPU clusters for data processing.

play_circle_outline AI factories as transformative architectures for data processing and efficiency.

play_circle_outline Role of memory cache efficiency in enhancing AI model performance and ROI.

play_circle_outline Model Flop Utilization (MFU) as a key metric for optimizing compute efficiency.

Info
Transcript

Preet Virk, Celestial

Preet Virk

Co-Founder, COO Celestial

For this theCUBE + NYSE Wired conversation from AI Factories – Data Centers of the Future, Celestial AI co-founder and chief operations officer Preet Virk joins theCUBE’s John Furrier at the New York Stock Exchange to unpack why interconnects – not just compute – now define AI-scale performance. Virk outlines the macro shift to GPU-dense clusters (single “compute units” measured in thousands of GPUs, with data centers growing from ~100K to 200K GPUs) and details why copper links top out at ~2–4 meters while photonics becomes mandatory beyond that. He quantifi... Read more

explore Keep Exploring

What are the trends in the growth and scaling of AI models and their hardware requirements? add

What are the implications of transforming data centers into AI factories for businesses, and how does this represent a fundamental shift in system architecture and economics compared to the past? add

What are the challenges and solutions associated with training large GPU systems, particularly regarding data movement and communication collectives? add

What is Model Flop Utilization (MFU) and what factors may impact its efficiency in large AI models? add

bolt Powered by CUBE AI

Preet Virk, Celestial

search

>> Hello, I'm John Furrier here at our New York Stock Exchange, part of theCUBE Studios on the East Coast. Of course we have theCUBE Studio in the West Coast in Palo Alto connecting Wall Street and Silicon Valley as part of our NYSC and NYSE Wired program and community. This is our AI factory series, the Future of the Data Center, Future of Computing, and we've been seeing a large scale surge of systems being re-architected, redefined, also built on top of existing systems where the chips and the technology are enabling new software stacks, new capabilities. It's also Climate Week here in New York, so of course climate tech, med tech health that all are empowered by large scale systems. Preet Virk is here, co-founder and COO of Celestial AI. They're leading the way in efficient power systems, connecting systems together. Preet, thanks for coming on and congratulations on your recent closing of your financing round, but you guys had over a half a billion dollars in funding to go to market and expand out the products. Thanks for coming on our AI Factory series.

Preet Virk

>> Thank you for having us on. Thank you.

>> So you guys have had extremely big success. Obviously the world of interconnects, these large scale systems. We're seeing evidence out there, NVIDIA, Open Compute project, people are building systems, the large hyper-scalers, even enterprises are looking at how they're going to re-architect their systems, called server storage networking in the old days, now to large scale systems, to be highly efficient for generative AI. This is an area you guys are in the middle of, and again, the tailwind and the wave you're on is enabling this. So give us the business update please because there's been a surge of general interest in the mainstream route. All these components, photonics, this is now hot, it's the hottest area. AI infrastructure is by far the hottest area in technology right now.

Preet Virk

>> Yeah. So let's look at the macro trend and then I'll fit in where we come in. I think everybody realizes now that the AI models have grown at a pace that nobody anticipated. And as the AI models get larger and larger, you need more and more XPUs, GPUs to compute and process that model, which translates simply into, you need massive clusters, which three, four years ago we would've never imagined. Data centers with a hundred thousand GPUs going to 200,000 GPUs are becoming common. The single compute unit is measured in thousands of GPUs now, not as a single GPU or a server blade. What that means is that all these GPUs need to be connected so that they work almost as if they were a single unit. And when I use the word connected, that's where the interconnect comes. The large cluster size also results in a reach issue. You can't pack these GPUs as close as you would like to because of the thermal considerations and the heat considerations, and therefore the reach is starting to become an issue where photonics comes in. Copper without re-timers, re-clockers is good enough for 2, 3, 4 meters. Beyond that, optics clearly is the right answer. So photonically connected, extremely large cluster sizes that have very high bandwidth, very low latency and very low power as measured in picojoules per bit is what these modern data centers really need. Now when you look at these data centers and look at the traffic patterns, the scale up networks is where 85% of the data movement is happening. 85%. What that means is translates into power is that 60%, 60% of the data center energy use is in data movement. Not compute, data movement. And actually that 60% power has a side effect, which is the model flop utilization. The computer efficiency is highly, highly dependent on the efficiency of your compute. We focused on the data movement problem early on, but to be very honest, we never realized that this problem is going to become such large component of a modern data center infrastructure. So what we focused on is photonic fabric. What we focused on is scale up network day one, not scale out, and that's where, as they say, the pain is and that's where the photonic fabric comes in. So what we allow simply is the industry to build very large clusters in a very efficient fashion.

>> The scale up, scale out, and then just recently NVIDIA was really marketing the term scale across, which is across data centers, speaks to the comment you mentioned around all the systems acting as one, essentially a super-computing, super-computer. This is a whole concept.

Preet Virk

>> Yeah.

>> Explain that.

Preet Virk

>> Back in the day, if you had a single core compute die, then we went to multi-core dies and we started focusing on things like network on chip so that these different cores on the die behave like one compute unit. We then started thinking of a single rack level solutions where the networking, the memory and the compute were all interconnected nicely and tightly and efficiently. But as I mentioned earlier, because the AI models are growing at such a torrid pace, we now need compute nodes, a node being many GPUs, that span multiple racks. Now for certain training and for certain check-pointing efficiency and so on and so forth and locality, there's also a need, which, this is what the scale across term that was introduced by NVIDIA, is to actually make different data centers behave as a single compute node. So that's just a sort of a natural progression of where you have gone from a server, to a rack, to multiple racks, to a data center, to multiple data centers. If I may introduce another term, which is scale in. So scale in is how many dies would you like to actually... You would want to include a lot of compute dies on a single package, inside a single package, and these are full radical dies. At Celestial, we actually built an optical multi-chip interface bridge, OMID, that allows you to optically connect these large dies that have high density, high speed IO in a small space within a package. So that's yet another term in the industry now, scale in. So the progression is scale in within a package, scale up, tying different GPUs, scale out, tying different clusters, and then scale across data centers.

>> I love that scale in term. I want to just double click on that if you don't mind. I mean scale out, everyone kind of understands the horizontal scaling. Cloud was great for that. Scale up is super important with scale across inside the racks because now the data centers are where the action is because of the enterprise and also the way they're designed is to leverage the performance, the latency for those workloads because the data's there. And so can you talk about the scale in, scale up dynamic, because I think they're related. And I'll throw one more thing in there if you don't mind, power, because the power envelope is bounding the architecture. So can you talk about the scale up and the scale in and how they're related or how should people think about that? Because scale up and scale in kind of seem related. Share your thoughts on that please.

Preet Virk

>> Sure. Actually, you triggered the power thought in my head. A lot of these transactions where I talked about 85% of the traffic is in the scale of network, 60% is data movement, in the data center, energy usage, a majority of these transactions that include data movement are remote memory transactions. So in the state of the art today, when GPU one goes and tries to access GPU-two-fifty-six's memory, you burn close to 55 picojoules per bit, 53 picojoules per bit. With our photonic fabric, we reduce that number to 13 picojoules per bit. So now extrapolate those two numbers. 60% of energy is being used is data movement, majority of those transactions are remote memory transactions and we provide a solution that's one fourth a power usage. And oh by the way, in terms of latency, with also at least a third better latency than the normal. So ideally you want to pack, going back to the scale-in, ideally you want to pack these compute elements with their associated memory as densely as possible.

>> Photonics...

Preet Virk

>> So the usual...

>> Go ahead. Continue.

Preet Virk

>> Yeah. The usual barriers to doing that are thermal and just space, simple physical space. When you start packaging more and more dies inside a package, and let's say you solve the thermal problem with liquid cooling or some other mechanism, one of the choices that we see our customers having to make is the loss of the beachfront, the shoreline. Because ideally you want to attach as many HBM's to the die as possible. When you package, let's say, four dies in a package, you now need to allocate beachfront to tie these four dies in an all-to-all connected configuration. Where OMIB comes in is we actually give the beachfront or the shoreline back to the customer such that they can either attach HBMs to it or use it for some other purpose because we sit right behind on die and provide a different plane, a different dimension for connecting these dies, which is the photonic waveguide. So the data moves inside a photonic waveguide, not on the electrical plane. Very high speed, very dense connectivity inside these packages is also very gnarly, very complicated signal integrity issues that you have to deal with. So not only do we give the physical space back, but we actually eliminate the need to deal with the outside because photons don't have EMI interference issues.

>> And a great power enhancement there, which the... Everyone's optimizing for power. I got to ask you about AI factories, this concept. I mean, go back 16 years ago when we started theCUBE, the big data movement was hot, Hadoop. I've heard references like, "Data's the new oil. These are refineries and we've got to process the data." Okay. AI factories, I have a nice visual in my head of this place where data's getting processed large scale. Talk about what that means to people in business because this is like, I think, the fundamental seed change is that they're thinking about their compute differently. Like you said, a data center acting like one system. This is a system architecture. I mean, this is not like a fashionable trend that may go away. This is like a complete changeover from the system economics, the system architecture of the nineties. I mean we chugged along from the nineties and got better and better and better, networking had intelligence, things got better, but now it's almost as if it's a whole nother transformation. Could you speak to this concept because the benefits of getting it right from power management, like you laid out, total cost of ownership, apps are going to run on this, data movement, scale up and even scale out too. I cloud. Cloud's not going away, but cloud scale has shown us that it's good, but that's not the answer sometimes because data is on premises. So these data centers are turning into kind of mini clouds, if you will. It's distributed computing. Describe this wave for the people that don't know the history. AI factories, what does it represent in your mind? Because we see it as a complete changeover. What's your reaction?

Preet Virk

>> Yeah, so first of all, I think to your point, industry's done a phenomenally good job of improving the compute and the compute efficiency given per unit of power or space. And if I go back to the 1980s, and I was still in college back then, there was a beautiful slogan termed by, I believe, Sun Microsystem, that the network is the computer, and then it kind of went away. In the AI data center space, call it the network or the interconnect, the interconnect is sort of the neurons that tie it all together. An inefficient interconnect will lead you to an inefficient data center, and say it the other way around. If you want to treat the data center as one supercomputer, then the efficiency of that supercomputer is directly proportional to the efficiency of your interconnect. What we have shown is that a higher efficiency interconnect, when I say efficiency, I mean power and latency, you can actually get two to three times compute out of the same power envelope that your current data center sits in. That is massive. Now going back to the factory. I love the word factory because the moment you say factory, then what is the factory producing? What's the output? I think in the last year or so, we have converged on a term, it serves tokens. The whole point of this is how efficiently can I serve tokens? So now going back to your data and factory analogy, let me break things down into at least three buckets. Obviously there's more. There's training, and in training you obviously need the largest cluster you can build for better training. Within training, again, the largest data movement is the communication collectives. These communication collectives when shared across a very, very large GPU, create a tremendous amount of traffic over the interconnect because each GPU needs to have the latest and the greatest copy of whatever was most recently processed. At Celestial, we have actually built a memory ASIC, disaggregated memory ASIC that goes into what we call a photonic fabric appliance. That appliance gives you 32 terabytes of unified memory space. Today on a GPU, you have 200 GB per GPU, multiple GPUs tied together, you got a very distributed architecture. What we can provide as a unified architecture so that in training, the communication collectives can be more efficiently dealt with. But then again, because training is so data movement heavy, the lower power, the lower latency comes into play as well. The second category is inferencing. This is where actually you make money. Training is an expense. Inferencing is where the eye of the ROI comes in. Sorry, the return of the ROI comes in. Training is the investment. So your ability to monetize your business better than the competitor is if you can serve a token more efficiently doing inferencing and making more money per minute per GPU. And this is where the KV cache efficiency comes in. The contact storage, whether it's an MOE, mixture of experts, reasoning models, they have actually... Studies have shown that a larger, more efficient KV cache gives you higher revenue per second per minute. In other words, it can serve tokens more efficiently per second. We actually internally our engineers use an AI agent, and what they notice is if they start a session in the morning, prompt after prompt after prompt after prompt when they're solving a problem, at some point the model forgets the entire session and kind of restarts, which is very frustrating for the engineers. And we believe what's happening is the KV cache is getting exhausted. So again, using the Photonic Fabric Appliance, you can provide a very, very large 32 terabytes of KV cache, that then improves your efficiency of serving tokens, improves the quality of responses because you're not resetting the KV cache as often. And last but not the least, going back to training. Check-pointing is a very, very big deal. A low latency, photonically connected check-pointing memory system is extremely helpful for not losing your work, and you can do check-pointing more frequently and more efficiently. Lastly, there's an interesting enterprise, going back to your data is the oil. A lot of the enterprises are very wary of sharing their proprietary domain-specific data and teaching an LLM their proprietary knowledge. So this thing called RAG has become the way to deal with that where the enterprise keeps their data in their own internal storage for RAG database and then only bring it out when the prompt necessitates it. So across training, across inference, across the new wave, I believe the enterprise segment interconnects and data are critical, interconnect data and storage.

>> You know, Preet, I love that description and I love that it's a master class right there. I'm definitely going to grab that as a highlight. Memory is a double-sided coin. There's actually memory, memory, like high bandwidth memory, computer memory, then there's the memory of the session on the GPU, as you pointed out, it loses its track. That's a cool point I want to quickly talk to you about the implications of. And two, you mentioned the enterprise. The enterprises, they know scale-up, they understand scale-up. You mentioned that. When you have a scale-up architecture that's not designed efficiently, you lose packets, you lose sessions, the GPUs are working harder. So you need to have very, very low latency and limited hops. And I'm getting kind of in the weeds here. I want to make a point because the data is there. That's their initial property. Data movement, of course, that's their initial property moving around. The scale-up architecture is super important because if you don't do it right, it blows the whole system. The purpose of the GPUs is to have memory of the session and to reason and to continue to serve the query or the prompt. And then also having the load latency managing through memory. Talk about that scale up and the GPU relationship because nobody wants idle GPUs restarting because this cost there too and power. Am I getting that right or... I mean it's very in the weeds. I want to get your thoughts on this.

Preet Virk

>> Absolutely, and actually just to put it a little different spin on what you just said. One thing we don't talk a lot about is MFUs, Model Flop Utilization. My comment earlier, the industry has done a phenomenal job packing very efficient compute into these processors, but our checks in the industry have indicated that some of the recent models that have been launched, the large AI models, Llama being one of them, the best case Model Flop Utilization we are seeing is about 40%. That means your very, very, very precious asset is not producing 60% of the times. Now, a hundred percent MFU is all theoretical. It's not practical. But what if you could improve it another 20%, right? What if you could take it to 60%? And that usually stalls are happening because of lack of ingress or egress data pipes availability, which is a bandwidth issue as well as a latency issue. So the distributed memory, meaning memory attached to a GPU times N, large number thousands, gives you a large aggregate memory, but it is fragmented. And for an enterprise, the MFUs, improving the MFU is a really, really, really big deal because that is a return on one of the most of a highest CapEx that you had invested in, whether it's on-prem or whether it's compute you bought in the cloud. So I think as the industry understands the bottleneck of data movement on how it's getting in the way of actually improving computer efficiency, you will see more and more adoption of the solutions such as ours or others, but things that are photonically connected with very high bandwidth, very fat pipes such that the congestion in the pipe is never an issue, very low latency such that the customer experience is never sacrificed because of latency, and obviously, last but not the least by any means, the power, the picojoules per bit that you need to move this data around. Let me just finish off by saying there's an old figure in the industry that's often been used. It's a triangle that shows the storage hierarchy. At the top obviously is SRAM. Nothing beats memory that's sitting right next to the compute. Nothing beats that, right? But then you can only put so much of SRAM on a die. Next is attached memory to the die, whether it's DDR, HBM or whatever your flavor of memory is. What Celestial has done is introduced a slice right next to the DDR such that for a two hundred-ish nanosecond latency, you can get to very, very large capacities, 30 terabytes for example. Because the other recourse today is state of the art is you go to SSDs, which now get into seconds or microseconds and they are further away, both in terms of latency, the bandwidth is low and they burn a lot more power. So I think in the AI space, we need that new level of hierarchy, memory hierarchy, that sits between directly attached to memory and a further, far out SSD pool.

>> Well, Preet, congratulations being the co-founder and chief operating officer of Celestial AI. You're in the right spot and you're doing amazing work to make these factories work. I love the network of this computer because the last GTC, Jensen Huang, the founder of NVIDIA said, and KV Cache, he said, "This is the operating system of the AI factory." And I'm like, "Wait a minute, that's networking." So networking is the operating system. So this is a new reality, so appreciate you.

Preet Virk

>> Thank you. Thank you for the opportunity.

>> Well, thanks for coming on. I'm John Furrier with theCUBE. We are here at the AI Factory media event all week. We're going to be outlaying all the top topics, the experts, the leaders who are making it happen, social AI building, the key component for these systems that will act as one super-computing, whether it's in its big data center or small or an edge, you will see the distributed computing architecture for AI developed super fast. Their factories, their refineries, whatever you want to call it, they'll help us with our apps. Thanks for watching.