theCUBE + NYSE Wired: The AI Factory - Data Center of the Future | Gilad Shainer, NVIDIA

theCUBE.net

- Links
  
  theCUBE Network SiliconANGLE Media
info Help

Gilad Shainer, NVIDIA

In this theCUBE + NYSE Wired segment from “AI Factories – Data Centers of the Future,” Nebius co-founder and CBO Roman Chernin sits down with theCUBE’s John Furrier at the New York Stock Exchange to unpack how AI factories are reshaping enterprise infrastructure and the future of data centers. Chernin outlines Nebius’ two-track strategy: a multi-tenant cloud built for developer experience and managed services, and large-scale, mostly bare-metal deployments for hyperscalers and AI labs. He discusses the significance of Nebius’ Microsoft deal (described as “up to $20B” and set to become one of the largest single-site GB300 deployments) as both an engineering milestone and a way to feed scale and cash flow back into the core cloud business. The conversation explores why enterprises want “the baby of supercomputer in the cloud,” marrying cloud flexibility with supercomputing efficiency to minimize time-to-value without sacrificing performance. Chernin details Nebius’ specialization in AI-centric workloads (large distributed training and inference at scale), a platform roadmap that moves beyond infrastructure into inference, fine-tuning and reinforcement learning as services, and a commitment to helping customers build on open-source models for control, cost and data leverage. He traces customer waves from foundational model builders to vertical AI companies and tech-forward enterprises, noting early traction with firms like Shopify and momentum in regulated sectors such as healthcare following Nebius’ compliance milestones. With roots in Yandex’s large-scale engineering culture and meaningful exposure to ClickHouse, Chernin also weighs in on the economics of AI-scale infrastructure (power and capacity as gating factors), hybrid orchestration and sovereignty, and why latency priorities vary by use case – from reasoning models to voice agents – as AI factories become the new unit of value in modern enterprise compute.

Share this session

Clips
More from theCUBE + NYSE Wired: The AI Factory - Data Center of the Future

Gilad Shainer

SVP of Networking

NVIDIA

play_circle_outline Optimizing AI Factories: The Need for Scalable Infrastructures and NVLink Fusion for Enhanced Virtual GPU Performance

play_circle_outline Revolutionizing AI Performance: Spectrum-X Ethernet for Jitter-Free Connectivity and Low Latency in Large-Scale Workloads

play_circle_outline Purpose-built Ethernet enhances AI performance compared to traditional off-the-shelf solutions.

play_circle_outline Openness through standard interfaces promotes flexibility, reducing lock-in for customers.

play_circle_outline Infrastructure impact on performance is crucial for efficient AI data centers.

play_circle_outline Co-packaged optics will significantly improve efficiency and reduce power consumption in data centers.

play_circle_outline Liquid cooling solutions are evolving for better integration and operational efficiency.

play_circle_outline Spectrum-XGS Ethernet connects multiple sites, forming interconnected gigascale AI factories.

play_circle_outline Advancing AI: The Future of Data Center Infrastructure, Gigascale Ethernet, and Rapid Technological Evolution

Info

Gilad Shainer, NVIDIA

Gilad Shainer

SVP of Networking NVIDIA

search

Dave Vellante

>> When we talk about AI at scale, it's no longer just about servers or even clusters, it's about building AI factories. And in that world, the network in many ways is just as important as the GPU. Now, the tension between scale-up and scale-out, it's been a constant in compute architecture over the years, but AI is forcing a fundamental rethink. NVIDIA has staked out an opinionated position with NVLink Fusion and Spectrum-X arguing that purpose-built fabrics versus retrofitted ethernet are the way to sustain performance at giga scale. Now, at the same time, competitors are pushing new standards purported to be more open. And here, we're talking initiatives like UALink, UEC and raising questions about lock-in interoperability and who really controls the future of the data center. We are thrilled to welcome our guest today, Gilad Shainer, who's the Senior Vice President of Networking at NVIDIA. Gilad, welcome to theCUBE. Really appreciate your time.

Gilad Shainer

>> Thank you very much. Happy to be here.

Dave Vellante

>> That's awesome. This is such an important topic and it's ever-changing. Let's talk about this whole scale-up versus scale-out. There's always been this tension, as I referenced upfront. How do you see the balance shifting in AI infrastructure? What role does the things like NVLink Fusion and Spectrum-X play in changing that dynamic?

Gilad Shainer

>> Yeah, good question. Personally, I don't think that there is a tension between scale-up and scale-out. Those are two infrastructures that have an important role as part of building an AI data center or AI factories. And actually, they are serving two different missions in a sense. Scale-up fabric, the mission of scale-up fabric is to connect those GPU ASICs together to form a larger virtual GPU to form, for example, a rec scale GPU. So scale-up infrastructure requires to support massive amount of bandwidth between GPU ASICs to support the right operations like load and store, very low latency, high message rate, and essentially be able to connect those GPU ASICs and to form a single GPU. Now, once you have the GPU being formed, you want to scale-out. So you want to take those rec scale GPUs for example, and connect hundreds of thousands of them together to build your air factory, and this is where the scale-out infrastructure comes in play, and scale-out infrastructure mission is essentially to enable a jitter-free, zero jitter connectivity between those hundreds of thousands of GPUs. As we all know, AI is distributed computing and when you're running distributed computing workloads across thousands, tens of thousands of hundreds of thousands of components, you cannot have delay for a single component because if I have 100,000 GPUs working together and only one GPU is going to be delayed, all the rest is going to wait, and we don't want that weight cycles. We want the AI data centers to operate and provide the best outcome, and this is the mission of scale-out. So essentially those two infrastructures need to work together. Now you can ask what would be the size of scale-out, what's going to be the size of scale-out and it's actually determined by the workloads that you want to run. Okay? So we don't see that as one versus the other. It's like both of them are working together.

Dave Vellante

>> So I want to follow up on that and clarify. So I use the word tension, maybe it's not the right word, but when I think of scale-up, I go back in history, I think of IBM Mainframe. When I think of scale-out, I think of Google File System, maybe trade-off was a better word. There's always been a trade-off in architectures. And what you are saying, if I understand it is with NVLink and NVLink Fusion, you're solving the scale-up problem with Spectrum-X, you're solving scale-out, you are having your cake and eating it too. Are you also not gaining weight? So I wonder if you could talk about how you're able to balance that what historically has been seen Gilad as a trade-off. You pick one or the other and you double down on it. How have you balanced that?

Gilad Shainer

>> Yeah, so we're not picking one versus the other obviously. How many GPU ASICs you want to put on a scale-up essentially forming the single GPU entity really depends on the workloads that you want to serve. So what we do in NVIDIA is not designing different components and then trying to figure out what we do in each component. We design the full data center. The data center is today the unit of computing, and when you design a full data center, you design it to serve specific workloads, and it could be training workloads, it could be inferencing workloads, it can be a combination of that workloads. And then based on the workloads that you serve, you know what is the size of the scale-up infrastructure you to build. You know what's the size of the scale out infrastructure that you want to build, and this is how you combine them both together. So there is no one versus the other in that sense. You try to find the optimal point of performance per power, performance per cost, performance per user, performance per token, for example, token per user. And based on that, you determine what's going to be the scale-up size, what's going to be the scale-out size. And as you mentioned for scale-up, we build NVLink and NVLink, we see NVLink as essentially an extension for the GPU. Okay? Some people may look on scale up as a network, it's not actual network, it's an extension of the compute side because it's forming one large compute entity and it's very complicated to build. And from that reason, we decided to bring NVLink or enable NVLink to anyone that requires scale-up infrastructure. Regardless in what XPU is using, what GPU is using, what CPU is using, it can actually leverage what we build, and then obviously on the scale out there is Spectrum-X Ethernet that we built. It's an ethernet that was built specifically for AI, and obviously we also provide that. So we build a data center. The size of scale up and scale out depends on the workload that you want to serve, and once we build a data center, we offering that as a whole or pieces to anyone that wants to leverage our technology.

Dave Vellante

>> Okay. So I'm going to keep you for longer than I promised because there's so many questions here. I wonder if we could dig into, well, let me actually start. My understanding of your design, just NVIDIA's approach to building GPUs is you build large, some people call them monolithic, which sounds like a pejorative, but I've always liked it. It's a large SRAM that's shared, and NVLink allows you to have synchronous connections across those GPUs. I oftentimes think about the whole chiplet architecture, which I've always thought was asynchronous and as another trade off, but that's one piece with that large SRAM I've always seen as advantage, but there are challenges there because as you scale, SRAM, real estate takes up more and more. And I understand that's a challenge you guys have to design around. But when I look at NVLink, it's gone from 160 gigabytes, I think to 1.8 terabytes with Blackwell and 10 terabytes plus with Rubin that's coming. And I compare that with PCIe general purpose standard. I think you've argued that having a purpose-built network is going to give you advantages. So I wonder if you could address those advantages from your perspective.

Gilad Shainer

>> Yeah, I think that basically everything is purpose built. Okay? If you think about it.

Dave Vellante

>> What do you mean by that? What do you mean by that?

Gilad Shainer

>> Yeah, and we can look for example on ethernet as an example. That would be a great example actually.

Dave Vellante

>> Okay.

Gilad Shainer

>> In the ethernet world, there are different kinds of ethernet, different kinds of product under the category of ethernet that was built to serve different kinds of workloads. There is no one ethernet switch architecture out there. There are many. And the reason that there are many is because each one of them was purposely built to serve different kinds of workloads. You can find ethernet that was built for heavy virtualized enterprise data centers. Good amount of functions, a lot of functions, lot of capabilities regarding virtualization, but small Redis. There is another kind of ethernet that was built to support Hyperskill cloud, single server workloads, good amount of Redis, but doesn't really care about jitter and jitter essentially is fine. There is another purpose-built ethernet that was designed to support service providers, for example, and utilizing debuffers in order to run a shock observer for distances and so forth. And we created Spectrum-X Ethernet, which is essentially it's ethernet purposely built for supporting disability computing of AI. So even when you look on a standard like ethernet, they are purposely built product for different kind of workloads. So that's why I don't see as something as a general fabric, there is no really general fabric. There is a lot of purpose built, which are based on standards and open source for example, but they're all purpose-built. And what we did is essentially build, took ethernet and created a purpose-built ethernet for AI for computing because that did not exist in the market.

Dave Vellante

>> Okay. That makes sense. So Spectrum-X, of course, people following along, that's NVIDIA's ethernet based optimized AI optimized fabric, I'll call it, that's this connects racks of GPUs across the data center. So now we're talking distance. You plug in, you pair it with SuperNICs, and you talked about jitter a couple of times. As those jitters plagued in traditional ethernet for a while, I'm sure the ultra ethernet folks are very much working on that and would say they solve that, but so your premise here is that everything is purpose-built. You leaned into ethernet, it's a standard, but yet it's optimized for AI. So that brings me to the openness argument. A lot of times people look at NVIDIA and say, "Well, it's a closed system, openness is a big theme in networking with ethernet, UALink, UEC, all the others jockeying for position." So what's your position on how the market should think about NVIDIA's approach to NVLink and Spectrum-X, specifically Gilad in the context of openness and ecosystem adoption and the whole avoiding lock-in? What do you say there?

Gilad Shainer

>> First, we love ecosystem, and second, we love openness. And we are, our infrastructure is based, for example, on open standardizations. InfiniBand, it's an open standard for example. The spec is being developed by the IBTA consortium and it's available to anyone that wants to build InfiniBand products, for example. And there is multiple companies that do build that in the market. Spectrumx is built on ethernet. It's a full ethernet. It's based on the Ethernet standardization. It's interoperable with any ethernet device that exists and it's full open from that perspective. So this is hardware-level systems. When we go to software, we love open source. So on the Spectrum-X Ethernet infrastructure, we are supporting the SONiC operating system, and SONiC operating system, it's open, it's open source supporting system. And the reason that we love standards and we love open source is because this is the way that our customers can innovate on top of the technology that we created. So obviously the advantage is how you design, how you build it inside, how you configure this stuff. But everything is based on open source, everything is based on standardizations. Even when we look on NVLink Fusion. NVLink Fusion, the interface of NVLink Fusion to connect to GPUs or connect to XPUs or connect to CPUs, it's based on an open interface. So you can choose to connect to it or you can choose to connect to anything else.

Dave Vellante

>> Okay. I know you don't want to bash the competition, that's not NVIDIA's style, but I'm inferring that legacy ethernet or traditional ethernet, if I can call it that, it wasn't specifically designed for AI workloads. It happened to be very good for the internet, but how do you think about, and of course, ultra ethernet, the consortium would say that fits well with AI, but can you help us understand in practical terms, your perspective as to why AI fabrics need something different and specifically your approach, what it means for enterprises that are trying to put in infrastructure that's going to be there for the future?

Gilad Shainer

>> Yeah. So first, regarding your comments on consortiums, by the way, NVIDIA is part of the Ultra Ethernet Consortium. We're to join any consortium that exists, happy to participate, happy to help the ecosystem. The ecosystem is important. And of course, if we will see good technology being developed, being specified, happy to use it, happy to bring it to NVIDIA. So we are part of the ecosystem and we are part of those consortium obviously. Now your question about AI, so AI is distributed computing. It's a classical example of distributed computing. It means that it's a workload that needs to run across many compute entities. Okay? The traditional off-the-shelf ethernet, off-the-shelf ethernet was not designed to support distributed computing. It was designed for single server workload for single CPU workloads. See, if you look on the design of off-the-shelf ethernet, it aims to move data in and out of a server, but it does not deal in a right way if you need to have the data to be distributed across multiple compute engines and be completely synchronized between them. Okay? When you break a problem across multiple compute engines, you want those compute engines to start to compute the same time, to finish the compute the same time and then go to the next phase. If one is late, then everyone else is waiting for it. So again, the example of a hundred thousands of GPUs and 99,999 finish the same time, but one is late, every GPU is waiting, and that's really expensive. So we needed to bring in an infrastructure that eliminated the jitter. That means that every GPU will get the data at the same time. And for that, we actually needed to create a new kind of implementation of ethernet. And we focused on lossless fabrics, which means that in off-the-shelf ethernet, if there is any contention on the network, the easy path was to drop data, to drop packet and actually retransmit it again, but in distributed computing workload, you cannot afford that retransmission, so you don't want to drop back. So we brought elements like lossless and we brought elements like full adaptive routing to distribute the fabric. We built an infrastructure because one thing we understand very in the beginning is that you cannot solve the networking problem for AI on a single device. You cannot solve everything on a switch. You need the switch to work together with a SuperNIC. Some functions will be done on the SuperNIC, some functions will be done on the switch, but it behaves as a single end-to-end infrastructure. Okay, those are the elements that made Spectrum-X Ethernet a great solution for scale out AI. And today, Spectrum-X Ethernet connecting the largest amount of GPUs running single job that no other ethernet version was able to achieve. Now, for enterprises, obviously what we did with Spectrum-X, give them the great solution because on one side, it's running ethernet. So all of the software ecosystem that they have built and invested over the years is maintained, but they get the right infrastructure, the right ethernet infrastructure to support AI workload so they can maximize their investment in building those AI data centers.

Dave Vellante

>> Okay. And you're talking about a classical distributed computing problem, you get at scale, you get diminishing returns, and if I understood you correctly, you've solved that problem. I wonder if you could, I'll go back to my earlier question. You didn't really answer my lock-in question, so I want to ask it again. And I tell you, I've been researching this. I've been an analyst for many, many decades, and I would say about 15% of the customers that I talked to truly care about lock-in and would make a business decision to avoid lock-in versus getting business value. 85% would say, "I'll take business value over fear of lock-in any day." Are you suggesting that A, you adopt open standards where it makes sense and B, I'm inferring, I don't want to put words in your mouth, but if you have a better solution, people are going to gravitate to it because it has greater business value. How do you address the lock-in concern?

Gilad Shainer

>> Yeah, so first, it's not that we adopted open source. Actually, we have been contributing to open source throughout the years. NVIDIA is one of the leading contribution, for example, in open source SONiC, which is the operating system that runs on ethernet switches. Now for your question, people can choose any element that we build and they can connect it to any other element that they choose. We have customers that choose to use our GPU, but they use that with different kinds of connectivity, and sometimes it's their own connectivity. There is customers that choose to use our infrastructure, but they use it with other XPUs or other compute elements. There is nothing that prohibits our customers to pick elements and to connect them to any other element. Every device that we have built has standard interface to it. Our SuperNIC can connect to any PCI Express device, for example, which means it's any CPU, any XPU, any GPU that exists in the market. Our switch has internet connectivity to it. You can connect it to any other internet switch, you can connect it to any other internet NIC. NVLink Fusion is connected to a standard interface, so you can create to other XPUs and XPUs can connect to another scale-up infrastructure if you want to. So there is a full flexibility. You can take any piece of that building block and any piece in that building block is standard interfaces that works at current tools, open storage, and there is a lot of open source software that runs on top of everything that we built.

Dave Vellante

>> Okay. So the entries and the exits into and from NVIDIA, the ecosystem are published, they're known in that sense, they're open. This is NVLink Fusion, is that correct? I remember the announcement that was very exciting. MediaTek, Marvell, Qualcomm, Astera Labs, I love Astera Labs, Synopsys, Cadence on and on and on, and they can bring in there whether it's their XPUs, CPUs, GPUs, NPUs, XPUs, let's call them, any accelerator into that NVIDIA GPU fabric. Is that the right way to think about it?

Gilad Shainer

>> Yeah, that's correct. So NVLink Fusion is built in a way that you can connect it in a standard interface to any accelerator that you wish or any CPU that you wish, and you can get connected to an XPU, to a CPU to both of them. You can choose any way. Now, if you want to connect to another scale-up fabric, it's the same interface, that same start interface that you have on your accelerator. So NVLink Fusion, the reason that we brought NVLink Fusion is that we understand how complicated it's to build scale-up infrastructure, and we wanted to share what we have built over the years to other people that can enjoy for that and can leverage from what we build from a great technology. And this is why we suggested or offer NVLink Fusion, and we made sure that NVLink Fusion has a standard interface to connect to. So you can choose NVLink Fusion or in the future you can choose any other scale up infrastructure that you wish. The same openness covers our SuperNICs, covers our Ethernet switches cover everything that we build in the infrastructure.

Dave Vellante

>> Got it. Okay. I want to talk about business value, cost efficiency, ROI. A lot of talk about ROI being elusive in AI. We all know about Moore's Law. Not as many people know about Wright's law. A lot of our research is based on Wright's law, very important we believe in semiconductor manufacturing, but Jensen's law is now sort of in vogue which is you buy more, you make more. So I want to ask you if the AI wave right now, it's like running on one cylinder and that cylinder is like CapEx from hyperscalers and Neoclouds and service providers, but where do you see the big gains? If you saw it, maybe you didn't see it, MIT just came up with a study, everybody is talking about it, how all the AI initiatives are going to fail, which I don't think is true. I think it's all learning, but where do you see the big efficiency in ROI gains coming in AI generally? But specifically from purpose-built scale-out networking like Spectrum-X and scale-up, which is a lot of the heritage versus some of the other stuff we talked about off the shelf. Where do you see that big payback?

Gilad Shainer

>> Yeah, so when we focus on the infrastructure, obviously you can choose different kind of infrastructure. You can take off the shelf, you can take something that was purposely built for AI, but the impact of the infrastructure on the performance of your data center is great. The infrastructure actually determine if you are going to have a server farm, let's say, or you have an AI supercomputer. It's the same example. There is different kind of vehicles and you can decide to take a minivan to a Formula 1 race car or racetrack. Is it going to drive there? Yes, it will. Do you have a chance to win the race? Probably not. Now, because there is good investment when you build an AI factory, there is a good amount of investment, the network or the infrastructure, if you have the right one, it's actually free. When you think about it, the impact of the infrastructure and the performance of the data center is much greater than the cost of that infrastructure. And that would be a difference between taking off-the-shelf ethernet that was not purposely built for AI versus to take spectrum Ethernet that actually was purposely built for AI. So it gives you all of the AI interfaces, all the stuff that you can leverage, all the investment that you've made on the software ecosystem, your management software, your access software, everything. But you made your AI investment much more efficient. That's the difference. That's why we did Spectrum-X, because using, by the way off-the-shelf ethernet for AI factories would made AI factories much, much, much more expensive.

Dave Vellante

>> I know we're talking about NVLink and ethernet. Can I ask you about QuantumX?

Gilad Shainer

>> Sure.

Dave Vellante

>> So if I understand it, you position InfiniBand, the flagship for supercomputing clusters. I've always loved InfiniBand. I wrote a piece many years ago, Larry Ellison loves InfiniBand. You remember he was connecting high speeds with Exadata and of course, watch Mellanox fantastic acquisition you guys made, but talk a little bit about QuantumX and InfiniBand, the future of InfiniBand and where you see that going.

Gilad Shainer

>> Yes. Yeah, InfiniBand is great and Spectrum-X Ethernet is great and we develop both scale-out infrastructure and we offer both. And you can choose what makes sense for you to use. Now InfiniBand or QuantumX in that sense, InfiniBand was designed ground up for distributed computing. That was the purpose of it, and it's a great technology for distributed computing. You can look as an example, you can check the recent 500 supercomputers list, and InfiniBand connects more than 270 supercomputers on the list, which is actually the highest number ever that InfiniBand connected on that list. It's a great example of why InfiniBand and the benefit of InfiniBand and InfiniBand is the gold standard for scale-out infrastructure for distributed computing workloads. It can be HPC, it can be scientific computing, it could be AI, and we have many, many customers that use QuantumX InfiniBand and enjoying the performance of that and leveraging all the benefit of that for running their AI workloads. In Spectrum-X. The way that we built Spectrum-X is that we took as much as we could from the implementation of InfiniBand into Ethernet. We took the lossless into Ethernet, we took the RDMA operations into Ethernet. We took the way that we do load balancing and congestion control to Ethernet. And that implementation enabled to build Spectrum-X Ethernet as the best Ethernet for AI or purpose-built for AI. Now as we move forward, we continue to develop elements that covering both QuantumX and InfiniBand and Spectrum-X Ethernet. The next generation that is coming is going to be with co-packaged silicon photonics. That's a great next phase of infrastructure to incorporate optical engines into switch packages, be able to reduce the number of components in the data center and increase its resiliency and reduce power consumption and improve scale in enabling more efficient data centers, that's the next phase, and that's next phase. We're bringing into QuantumX photonics on the InfiniBand side and into Spectrum-X Ethernet photonics for the Ethernet side.

Dave Vellante

>> So I told you I was going to keep you longer than I said, and I promised a couple of things. One is you had an announcement today, NVIDIA did with Fujitsu to co-design Fugaku next, which is this successor to the well-known Fugaku supercomputer. So congratulations on that announcement that came out. I think today I got it, or maybe it was yesterday. Today's the 22nd. Yeah, it was yesterday. And then you mentioned photonics. So as you grow into this gigawatt era of AI, power is super important, optics they're like a hidden tax. So I wonder, what do you think about how do co-packaged optics and Spectrum-X photonics affect the economics of scaling out when we hear about millions of GPUs being connected? I wonder if you could address that.

Gilad Shainer

>> Yeah, yeah, of course. Happy to. So first, obviously before you go to optics, you want to maximize copper.

Dave Vellante

>> Yeah.

Gilad Shainer

>> You want to maximize usage of copper because copper is zero power. It's very reliable, it's very cost-effective, right? So you want to maximize copper, and this is why we're focusing on copper for scale up. We brought elements like liquid computing technologies in order to increase density in the rack so we can connect more GPU ASICs over copper, because that's the best connectivity. But when you go to scale out, you cannot use copper because of the reach, and this is where optics is the way to connect that infrastructure. Now, as data center or AI data center is increasing in size, and we increase the bandwidth per GPU on a scale out infrastructure, the power that optical network going to scale up consume start to be a big number. And today, it can go almost to a 10% of the compute capacity. Now we all know that power is the one of the limiting factor when building AI factories, and the more that you can save power, it means that you can bring more compute and more leverage the power capability that you have. So this is where we deal with liquid computing, liquid cooling, sorry, technologies and so forth. For the scale out, the way to reduce power to optimize power is essentially to help and take the optical connections to the next level. Today, optical connections are running based on pluggable switches or transceivers, which means that you have the optical engine sits on a transceiver, which is somewhat far from the switch itself, which means that you need to invest energy in order to drive that signal from the optical engines that sits outside the box all the way to the switch package that sits inside of the box. By moving that optical engine from outside the box to inside the box, you can save a lot of energy. So you can reduce the power consumption of that scale out infrastructure by 3.5X. And that means that on the same ISO power, the same scale out infrastructure, ISO power, you can connect 3X more GPUs. That's how important it is.

Dave Vellante

>> Wow.

Gilad Shainer

>> And this is where CPO is very key and it's important. It's a key element, and we are bringing our CPO into the 200 gig series generation for both QuantumX photonics and Spectrum-X Ethernet photonics.

Dave Vellante

>> Amazing. I'm so lucky that John Furrier, my business partner, was golfing today and I got a chance to interview you. This is fascinating. You keep bringing up topics and I keep thinking about more questions. You brought up liquid cooling, I wonder if you could comment on the state of direct liquid cooling and things like secondary fluid networks and the connection integrity of the boring, but important hoses. It's come a long way, but I remember attending some supercomputing conferences a few years ago and looking at the mesh of hoses thinking, "Wow, this is not very elegant." And now I see them today and they're much more beautiful. Is the liquid cooling ecosystem where you want it to be? Are you happy with the progress that's been made? Can you comment on that?

Gilad Shainer

>> Yeah, we are working in an ecosystem and we have great ecosystem partners across everything we built. Now we have ecosystem partners as part of the co-packaged optics and photonics with ecosystem partners around building liquid-cooled elements that we need and so forth. Obviously, there are things that continue to progress and there is new elements that being created and we're exposed to new technology, which is just amazing. Just amazing. We focusing on liquid cooling for the compute generation that we're driving now out. And obviously, it's going to be integrated part as part of everything else we do. We go to our enabling a 100% liquid cooling on the infrastructure. We build the infrastructure, the scale-out switches and the scale-out switches based on liquid cooling, driving the 100% liquid cooling there, and it's important for us to make it easier for people to consume that or easier to people to build that. And that's why we're sharing the same racks or share the same liquid computing racks for the compute devices, as well of the networking devices or infrastructure devices. And we try to make it very simple to consume, to enjoy and so forth. So obviously, there is more things to do, right? There is never ending, there is no endgame here. We don't see an endgame and this is why it's so fascinating to work in this area. But yes, I think we're very happy with where we are right now, but we also see the other things that will needed to be developed for the coming generation, and this is where our focus is on.

Dave Vellante

>> I have one follow up and one last question I'll let you go. You said 100% liquid cooling. So today, I know a lot of the training is being done on servers that have a hybrid of air and liquid. And do you see that as a bridge that we're ultimately going to be 100% liquid cooling or do you see that hybrid existing or is it presumed it's use case dependent, but for the really demanding AI workloads, will those eventually be and maybe they are already 100% liquid cooling?

Gilad Shainer

>> Yeah, so obviously if you look back, it was majority of data centers were air-cooled and then they start to migrate into hybrid and then to a full liquid cool data centers. And it depends on the target, what data center you build, what's the size of the data center that we have, what facilities you need to use and so forth. Obviously, liquid cooling is the right path. It's a good path to increase compute density. It's the right path to reduce power consumptions and actually be able to increase the capability of the data center. It's the right path to silicon photonics, the integration of photonics into the switch infrastructure and so forth. And therefore, we're probably going to see more and more liquid cooling used in the new data centers that are being built and for future technologies.

Dave Vellante

>> That's amazing. All right, last question. Just zoom out, just your telescope long-term vision. So if you think about five years, I know it's a long time in this era, but what should we expect AI infrastructure, compute infrastructure to look like? Is it going to converge around a handful of purpose-built fabrics or are you going to see, you think a hybrid coexistence of some off-the-shelf technologies that maybe weren't purpose-built, maybe there's some emerging standards. What do you see as the endgame here or the steady state? There really is no endgame as you just pointed out.

Gilad Shainer

>> Yeah. When you talk about outlook of five years, it's very interesting question and the reason that it's a very interesting question is because I would like to know that too. If you look five years back, you would already may five years back where we're going to be today, right? You wouldn't completely wouldn't imagine. It's like the world has changed seven times in that sense. The amount of technology that was created is amazing what happened in the last five years. So if you ask me what's going to be the next five years, it's a really interesting question. So first, I can tell you what we're focusing now, and right now, it's getting co-packaged optics. We're looking on the next generation. Co-packaged optics, one side. We did announce, we just announced the Spectrum-XGS, Spectrum-X for GigaScale Ethernet, and that's become also important because now when you maximize your compute in a single data center, it's not enough for the workloads that you want to run. We see the need actually to connect multiple data centers together to form the gigascale AI factories to support the next generation of workloads. And for doing that, you need to start running over long distances. So we talked about scale-up, we talked about scale-out. Now we have scale across. We have a new infrastructure being created to support data center, to data center connectivity, to enable AI across data centers. And this is what we announced now, and we announced that CoreWeave is the first customer size that is implementing the Spectrum-XGS Ethernet to connect multiple sites together to form gigascale AI factories. So those two things are going to be core elements of the next generation of infrastructure. Moving five years, AI is growing an amazing pace. Every data center is going to be accelerated. You want to maximize your capability in every data center, and therefore you will see purpose built, but industry standard-based and open source to support those data centers, and the cadence, the cadence of the technology. Today, it's an annual cadence. Every year, you see a new infrastructure, you see new switches, you see new NICs, you see new accelerators being built. So in five years, we're talking about five generations. Okay? So this is why I said it's a very interesting question.

Dave Vellante

>> We call it moving at the speed of NVIDIA and the industry is doing that. When are you going to solve the speed of light problem, Gilad? Are you working on that?

Gilad Shainer

>> That's an easy task. That was an easy task, that was already solved.

Dave Vellante

>> Right. Hey, this has been amazing conversation, Gilad. Thank you so much for taking the time. It was really a pleasure having you.

Gilad Shainer

>> Yeah, thank you for having me there, and see you next time.

Dave Vellante

>> Yeah, would love to have you back. And thank you for watching this CUBE conversation. This is Dave Vellante for theCUBE, and we will see you next time.

Gilad Shainer, NVIDIA

search

Dave Vellante

>> When we talk about AI at scale, it's no longer just about servers or even clusters, it's about building AI factories. And in that world, the network in many ways is just as important as the GPU. Now, the tension between scale-up and scale-out, it's been a constant in compute architecture over the years, but AI is forcing a fundamental rethink. NVIDIA has staked out an opinionated position with NVLink Fusion and Spectrum-X arguing that purpose-built fabrics versus retrofitted ethernet are the way to sustain performance at giga scale. Now, at the same time, competitors are pushing new standards purported to be more open. And here, we're talking initiatives like UALink, UEC and raising questions about lock-in interoperability and who really controls the future of the data center. We are thrilled to welcome our guest today, Gilad Shainer, who's the Senior Vice President of Networking at NVIDIA. Gilad, welcome to theCUBE. Really appreciate your time.

Gilad Shainer

>> Thank you very much. Happy to be here.

Dave Vellante

>> That's awesome. This is such an important topic and it's ever-changing. Let's talk about this whole scale-up versus scale-out. There's always been this tension, as I referenced upfront. How do you see the balance shifting in AI infrastructure? What role does the things like NVLink Fusion and Spectrum-X play in changing that dynamic?

Gilad Shainer

>> Yeah, good question. Personally, I don't think that there is a tension between scale-up and scale-out. Those are two infrastructures that have an important role as part of building an AI data center or AI factories. And actually, they are serving two different missions in a sense. Scale-up fabric, the mission of scale-up fabric is to connect those GPU ASICs together to form a larger virtual GPU to form, for example, a rec scale GPU. So scale-up infrastructure requires to support massive amount of bandwidth between GPU ASICs to support the right operations like load and store, very low latency, high message rate, and essentially be able to connect those GPU ASICs and to form a single GPU. Now, once you have the GPU being formed, you want to scale-out. So you want to take those rec scale GPUs for example, and connect hundreds of thousands of them together to build your air factory, and this is where the scale-out infrastructure comes in play, and scale-out infrastructure mission is essentially to enable a jitter-free, zero jitter connectivity between those hundreds of thousands of GPUs. As we all know, AI is distributed computing and when you're running distributed computing workloads across thousands, tens of thousands of hundreds of thousands of components, you cannot have delay for a single component because if I have 100,000 GPUs working together and only one GPU is going to be delayed, all the rest is going to wait, and we don't want that weight cycles. We want the AI data centers to operate and provide the best outcome, and this is the mission of scale-out. So essentially those two infrastructures need to work together. Now you can ask what would be the size of scale-out, what's going to be the size of scale-out and it's actually determined by the workloads that you want to run. Okay? So we don't see that as one versus the other. It's like both of them are working together.

Dave Vellante

>> So I want to follow up on that and clarify. So I use the word tension, maybe it's not the right word, but when I think of scale-up, I go back in history, I think of IBM Mainframe. When I think of scale-out, I think of Google File System, maybe trade-off was a better word. There's always been a trade-off in architectures. And what you are saying, if I understand it is with NVLink and NVLink Fusion, you're solving the scale-up problem with Spectrum-X, you're solving scale-out, you are having your cake and eating it too. Are you also not gaining weight? So I wonder if you could talk about how you're able to balance that what historically has been seen Gilad as a trade-off. You pick one or the other and you double down on it. How have you balanced that?

Gilad Shainer

>> Yeah, so we're not picking one versus the other obviously. How many GPU ASICs you want to put on a scale-up essentially forming the single GPU entity really depends on the workloads that you want to serve. So what we do in NVIDIA is not designing different components and then trying to figure out what we do in each component. We design the full data center. The data center is today the unit of computing, and when you design a full data center, you design it to serve specific workloads, and it could be training workloads, it could be inferencing workloads, it can be a combination of that workloads. And then based on the workloads that you serve, you know what is the size of the scale-up infrastructure you to build. You know what's the size of the scale out infrastructure that you want to build, and this is how you combine them both together. So there is no one versus the other in that sense. You try to find the optimal point of performance per power, performance per cost, performance per user, performance per token, for example, token per user. And based on that, you determine what's going to be the scale-up size, what's going to be the scale-out size. And as you mentioned for scale-up, we build NVLink and NVLink, we see NVLink as essentially an extension for the GPU. Okay? Some people may look on scale up as a network, it's not actual network, it's an extension of the compute side because it's forming one large compute entity and it's very complicated to build. And from that reason, we decided to bring NVLink or enable NVLink to anyone that requires scale-up infrastructure. Regardless in what XPU is using, what GPU is using, what CPU is using, it can actually leverage what we build, and then obviously on the scale out there is Spectrum-X Ethernet that we built. It's an ethernet that was built specifically for AI, and obviously we also provide that. So we build a data center. The size of scale up and scale out depends on the workload that you want to serve, and once we build a data center, we offering that as a whole or pieces to anyone that wants to leverage our technology.

Dave Vellante

>> Okay. So I'm going to keep you for longer than I promised because there's so many questions here. I wonder if we could dig into, well, let me actually start. My understanding of your design, just NVIDIA's approach to building GPUs is you build large, some people call them monolithic, which sounds like a pejorative, but I've always liked it. It's a large SRAM that's shared, and NVLink allows you to have synchronous connections across those GPUs. I oftentimes think about the whole chiplet architecture, which I've always thought was asynchronous and as another trade off, but that's one piece with that large SRAM I've always seen as advantage, but there are challenges there because as you scale, SRAM, real estate takes up more and more. And I understand that's a challenge you guys have to design around. But when I look at NVLink, it's gone from 160 gigabytes, I think to 1.8 terabytes with Blackwell and 10 terabytes plus with Rubin that's coming. And I compare that with PCIe general purpose standard. I think you've argued that having a purpose-built network is going to give you advantages. So I wonder if you could address those advantages from your perspective.

Gilad Shainer

>> Yeah, I think that basically everything is purpose built. Okay? If you think about it.

Dave Vellante

>> What do you mean by that? What do you mean by that?

Gilad Shainer

>> Yeah, and we can look for example on ethernet as an example. That would be a great example actually.

Dave Vellante

>> Okay.

Gilad Shainer

>> In the ethernet world, there are different kinds of ethernet, different kinds of product under the category of ethernet that was built to serve different kinds of workloads. There is no one ethernet switch architecture out there. There are many. And the reason that there are many is because each one of them was purposely built to serve different kinds of workloads. You can find ethernet that was built for heavy virtualized enterprise data centers. Good amount of functions, a lot of functions, lot of capabilities regarding virtualization, but small Redis. There is another kind of ethernet that was built to support Hyperskill cloud, single server workloads, good amount of Redis, but doesn't really care about jitter and jitter essentially is fine. There is another purpose-built ethernet that was designed to support service providers, for example, and utilizing debuffers in order to run a shock observer for distances and so forth. And we created Spectrum-X Ethernet, which is essentially it's ethernet purposely built for supporting disability computing of AI. So even when you look on a standard like ethernet, they are purposely built product for different kind of workloads. So that's why I don't see as something as a general fabric, there is no really general fabric. There is a lot of purpose built, which are based on standards and open source for example, but they're all purpose-built. And what we did is essentially build, took ethernet and created a purpose-built ethernet for AI for computing because that did not exist in the market.

Dave Vellante

>> Okay. That makes sense. So Spectrum-X, of course, people following along, that's NVIDIA's ethernet based optimized AI optimized fabric, I'll call it, that's this connects racks of GPUs across the data center. So now we're talking distance. You plug in, you pair it with SuperNICs, and you talked about jitter a couple of times. As those jitters plagued in traditional ethernet for a while, I'm sure the ultra ethernet folks are very much working on that and would say they solve that, but so your premise here is that everything is purpose-built. You leaned into ethernet, it's a standard, but yet it's optimized for AI. So that brings me to the openness argument. A lot of times people look at NVIDIA and say, "Well, it's a closed system, openness is a big theme in networking with ethernet, UALink, UEC, all the others jockeying for position." So what's your position on how the market should think about NVIDIA's approach to NVLink and Spectrum-X, specifically Gilad in the context of openness and ecosystem adoption and the whole avoiding lock-in? What do you say there?

Gilad Shainer

>> First, we love ecosystem, and second, we love openness. And we are, our infrastructure is based, for example, on open standardizations. InfiniBand, it's an open standard for example. The spec is being developed by the IBTA consortium and it's available to anyone that wants to build InfiniBand products, for example. And there is multiple companies that do build that in the market. Spectrumx is built on ethernet. It's a full ethernet. It's based on the Ethernet standardization. It's interoperable with any ethernet device that exists and it's full open from that perspective. So this is hardware-level systems. When we go to software, we love open source. So on the Spectrum-X Ethernet infrastructure, we are supporting the SONiC operating system, and SONiC operating system, it's open, it's open source supporting system. And the reason that we love standards and we love open source is because this is the way that our customers can innovate on top of the technology that we created. So obviously the advantage is how you design, how you build it inside, how you configure this stuff. But everything is based on open source, everything is based on standardizations. Even when we look on NVLink Fusion. NVLink Fusion, the interface of NVLink Fusion to connect to GPUs or connect to XPUs or connect to CPUs, it's based on an open interface. So you can choose to connect to it or you can choose to connect to anything else.

Dave Vellante

>> Okay. I know you don't want to bash the competition, that's not NVIDIA's style, but I'm inferring that legacy ethernet or traditional ethernet, if I can call it that, it wasn't specifically designed for AI workloads. It happened to be very good for the internet, but how do you think about, and of course, ultra ethernet, the consortium would say that fits well with AI, but can you help us understand in practical terms, your perspective as to why AI fabrics need something different and specifically your approach, what it means for enterprises that are trying to put in infrastructure that's going to be there for the future?

Gilad Shainer

>> Yeah. So first, regarding your comments on consortiums, by the way, NVIDIA is part of the Ultra Ethernet Consortium. We're to join any consortium that exists, happy to participate, happy to help the ecosystem. The ecosystem is important. And of course, if we will see good technology being developed, being specified, happy to use it, happy to bring it to NVIDIA. So we are part of the ecosystem and we are part of those consortium obviously. Now your question about AI, so AI is distributed computing. It's a classical example of distributed computing. It means that it's a workload that needs to run across many compute entities. Okay? The traditional off-the-shelf ethernet, off-the-shelf ethernet was not designed to support distributed computing. It was designed for single server workload for single CPU workloads. See, if you look on the design of off-the-shelf ethernet, it aims to move data in and out of a server, but it does not deal in a right way if you need to have the data to be distributed across multiple compute engines and be completely synchronized between them. Okay? When you break a problem across multiple compute engines, you want those compute engines to start to compute the same time, to finish the compute the same time and then go to the next phase. If one is late, then everyone else is waiting for it. So again, the example of a hundred thousands of GPUs and 99,999 finish the same time, but one is late, every GPU is waiting, and that's really expensive. So we needed to bring in an infrastructure that eliminated the jitter. That means that every GPU will get the data at the same time. And for that, we actually needed to create a new kind of implementation of ethernet. And we focused on lossless fabrics, which means that in off-the-shelf ethernet, if there is any contention on the network, the easy path was to drop data, to drop packet and actually retransmit it again, but in distributed computing workload, you cannot afford that retransmission, so you don't want to drop back. So we brought elements like lossless and we brought elements like full adaptive routing to distribute the fabric. We built an infrastructure because one thing we understand very in the beginning is that you cannot solve the networking problem for AI on a single device. You cannot solve everything on a switch. You need the switch to work together with a SuperNIC. Some functions will be done on the SuperNIC, some functions will be done on the switch, but it behaves as a single end-to-end infrastructure. Okay, those are the elements that made Spectrum-X Ethernet a great solution for scale out AI. And today, Spectrum-X Ethernet connecting the largest amount of GPUs running single job that no other ethernet version was able to achieve. Now, for enterprises, obviously what we did with Spectrum-X, give them the great solution because on one side, it's running ethernet. So all of the software ecosystem that they have built and invested over the years is maintained, but they get the right infrastructure, the right ethernet infrastructure to support AI workload so they can maximize their investment in building those AI data centers.

Dave Vellante

>> Okay. And you're talking about a classical distributed computing problem, you get at scale, you get diminishing returns, and if I understood you correctly, you've solved that problem. I wonder if you could, I'll go back to my earlier question. You didn't really answer my lock-in question, so I want to ask it again. And I tell you, I've been researching this. I've been an analyst for many, many decades, and I would say about 15% of the customers that I talked to truly care about lock-in and would make a business decision to avoid lock-in versus getting business value. 85% would say, "I'll take business value over fear of lock-in any day." Are you suggesting that A, you adopt open standards where it makes sense and B, I'm inferring, I don't want to put words in your mouth, but if you have a better solution, people are going to gravitate to it because it has greater business value. How do you address the lock-in concern?

Gilad Shainer

>> Yeah, so first, it's not that we adopted open source. Actually, we have been contributing to open source throughout the years. NVIDIA is one of the leading contribution, for example, in open source SONiC, which is the operating system that runs on ethernet switches. Now for your question, people can choose any element that we build and they can connect it to any other element that they choose. We have customers that choose to use our GPU, but they use that with different kinds of connectivity, and sometimes it's their own connectivity. There is customers that choose to use our infrastructure, but they use it with other XPUs or other compute elements. There is nothing that prohibits our customers to pick elements and to connect them to any other element. Every device that we have built has standard interface to it. Our SuperNIC can connect to any PCI Express device, for example, which means it's any CPU, any XPU, any GPU that exists in the market. Our switch has internet connectivity to it. You can connect it to any other internet switch, you can connect it to any other internet NIC. NVLink Fusion is connected to a standard interface, so you can create to other XPUs and XPUs can connect to another scale-up infrastructure if you want to. So there is a full flexibility. You can take any piece of that building block and any piece in that building block is standard interfaces that works at current tools, open storage, and there is a lot of open source software that runs on top of everything that we built.

Dave Vellante

>> Okay. So the entries and the exits into and from NVIDIA, the ecosystem are published, they're known in that sense, they're open. This is NVLink Fusion, is that correct? I remember the announcement that was very exciting. MediaTek, Marvell, Qualcomm, Astera Labs, I love Astera Labs, Synopsys, Cadence on and on and on, and they can bring in there whether it's their XPUs, CPUs, GPUs, NPUs, XPUs, let's call them, any accelerator into that NVIDIA GPU fabric. Is that the right way to think about it?

Gilad Shainer

>> Yeah, that's correct. So NVLink Fusion is built in a way that you can connect it in a standard interface to any accelerator that you wish or any CPU that you wish, and you can get connected to an XPU, to a CPU to both of them. You can choose any way. Now, if you want to connect to another scale-up fabric, it's the same interface, that same start interface that you have on your accelerator. So NVLink Fusion, the reason that we brought NVLink Fusion is that we understand how complicated it's to build scale-up infrastructure, and we wanted to share what we have built over the years to other people that can enjoy for that and can leverage from what we build from a great technology. And this is why we suggested or offer NVLink Fusion, and we made sure that NVLink Fusion has a standard interface to connect to. So you can choose NVLink Fusion or in the future you can choose any other scale up infrastructure that you wish. The same openness covers our SuperNICs, covers our Ethernet switches cover everything that we build in the infrastructure.

Dave Vellante

>> Got it. Okay. I want to talk about business value, cost efficiency, ROI. A lot of talk about ROI being elusive in AI. We all know about Moore's Law. Not as many people know about Wright's law. A lot of our research is based on Wright's law, very important we believe in semiconductor manufacturing, but Jensen's law is now sort of in vogue which is you buy more, you make more. So I want to ask you if the AI wave right now, it's like running on one cylinder and that cylinder is like CapEx from hyperscalers and Neoclouds and service providers, but where do you see the big gains? If you saw it, maybe you didn't see it, MIT just came up with a study, everybody is talking about it, how all the AI initiatives are going to fail, which I don't think is true. I think it's all learning, but where do you see the big efficiency in ROI gains coming in AI generally? But specifically from purpose-built scale-out networking like Spectrum-X and scale-up, which is a lot of the heritage versus some of the other stuff we talked about off the shelf. Where do you see that big payback?

Gilad Shainer

>> Yeah, so when we focus on the infrastructure, obviously you can choose different kind of infrastructure. You can take off the shelf, you can take something that was purposely built for AI, but the impact of the infrastructure on the performance of your data center is great. The infrastructure actually determine if you are going to have a server farm, let's say, or you have an AI supercomputer. It's the same example. There is different kind of vehicles and you can decide to take a minivan to a Formula 1 race car or racetrack. Is it going to drive there? Yes, it will. Do you have a chance to win the race? Probably not. Now, because there is good investment when you build an AI factory, there is a good amount of investment, the network or the infrastructure, if you have the right one, it's actually free. When you think about it, the impact of the infrastructure and the performance of the data center is much greater than the cost of that infrastructure. And that would be a difference between taking off-the-shelf ethernet that was not purposely built for AI versus to take spectrum Ethernet that actually was purposely built for AI. So it gives you all of the AI interfaces, all the stuff that you can leverage, all the investment that you've made on the software ecosystem, your management software, your access software, everything. But you made your AI investment much more efficient. That's the difference. That's why we did Spectrum-X, because using, by the way off-the-shelf ethernet for AI factories would made AI factories much, much, much more expensive.

Dave Vellante

>> I know we're talking about NVLink and ethernet. Can I ask you about QuantumX?

Gilad Shainer

>> Sure.

Dave Vellante

>> So if I understand it, you position InfiniBand, the flagship for supercomputing clusters. I've always loved InfiniBand. I wrote a piece many years ago, Larry Ellison loves InfiniBand. You remember he was connecting high speeds with Exadata and of course, watch Mellanox fantastic acquisition you guys made, but talk a little bit about QuantumX and InfiniBand, the future of InfiniBand and where you see that going.

Gilad Shainer

>> Yes. Yeah, InfiniBand is great and Spectrum-X Ethernet is great and we develop both scale-out infrastructure and we offer both. And you can choose what makes sense for you to use. Now InfiniBand or QuantumX in that sense, InfiniBand was designed ground up for distributed computing. That was the purpose of it, and it's a great technology for distributed computing. You can look as an example, you can check the recent 500 supercomputers list, and InfiniBand connects more than 270 supercomputers on the list, which is actually the highest number ever that InfiniBand connected on that list. It's a great example of why InfiniBand and the benefit of InfiniBand and InfiniBand is the gold standard for scale-out infrastructure for distributed computing workloads. It can be HPC, it can be scientific computing, it could be AI, and we have many, many customers that use QuantumX InfiniBand and enjoying the performance of that and leveraging all the benefit of that for running their AI workloads. In Spectrum-X. The way that we built Spectrum-X is that we took as much as we could from the implementation of InfiniBand into Ethernet. We took the lossless into Ethernet, we took the RDMA operations into Ethernet. We took the way that we do load balancing and congestion control to Ethernet. And that implementation enabled to build Spectrum-X Ethernet as the best Ethernet for AI or purpose-built for AI. Now as we move forward, we continue to develop elements that covering both QuantumX and InfiniBand and Spectrum-X Ethernet. The next generation that is coming is going to be with co-packaged silicon photonics. That's a great next phase of infrastructure to incorporate optical engines into switch packages, be able to reduce the number of components in the data center and increase its resiliency and reduce power consumption and improve scale in enabling more efficient data centers, that's the next phase, and that's next phase. We're bringing into QuantumX photonics on the InfiniBand side and into Spectrum-X Ethernet photonics for the Ethernet side.

Dave Vellante

>> So I told you I was going to keep you longer than I said, and I promised a couple of things. One is you had an announcement today, NVIDIA did with Fujitsu to co-design Fugaku next, which is this successor to the well-known Fugaku supercomputer. So congratulations on that announcement that came out. I think today I got it, or maybe it was yesterday. Today's the 22nd. Yeah, it was yesterday. And then you mentioned photonics. So as you grow into this gigawatt era of AI, power is super important, optics they're like a hidden tax. So I wonder, what do you think about how do co-packaged optics and Spectrum-X photonics affect the economics of scaling out when we hear about millions of GPUs being connected? I wonder if you could address that.

Gilad Shainer

>> Yeah, yeah, of course. Happy to. So first, obviously before you go to optics, you want to maximize copper.

Dave Vellante

>> Yeah.

Gilad Shainer

>> You want to maximize usage of copper because copper is zero power. It's very reliable, it's very cost-effective, right? So you want to maximize copper, and this is why we're focusing on copper for scale up. We brought elements like liquid computing technologies in order to increase density in the rack so we can connect more GPU ASICs over copper, because that's the best connectivity. But when you go to scale out, you cannot use copper because of the reach, and this is where optics is the way to connect that infrastructure. Now, as data center or AI data center is increasing in size, and we increase the bandwidth per GPU on a scale out infrastructure, the power that optical network going to scale up consume start to be a big number. And today, it can go almost to a 10% of the compute capacity. Now we all know that power is the one of the limiting factor when building AI factories, and the more that you can save power, it means that you can bring more compute and more leverage the power capability that you have. So this is where we deal with liquid computing, liquid cooling, sorry, technologies and so forth. For the scale out, the way to reduce power to optimize power is essentially to help and take the optical connections to the next level. Today, optical connections are running based on pluggable switches or transceivers, which means that you have the optical engine sits on a transceiver, which is somewhat far from the switch itself, which means that you need to invest energy in order to drive that signal from the optical engines that sits outside the box all the way to the switch package that sits inside of the box. By moving that optical engine from outside the box to inside the box, you can save a lot of energy. So you can reduce the power consumption of that scale out infrastructure by 3.5X. And that means that on the same ISO power, the same scale out infrastructure, ISO power, you can connect 3X more GPUs. That's how important it is.

Dave Vellante

>> Wow.

Gilad Shainer

>> And this is where CPO is very key and it's important. It's a key element, and we are bringing our CPO into the 200 gig series generation for both QuantumX photonics and Spectrum-X Ethernet photonics.

Dave Vellante

>> Amazing. I'm so lucky that John Furrier, my business partner, was golfing today and I got a chance to interview you. This is fascinating. You keep bringing up topics and I keep thinking about more questions. You brought up liquid cooling, I wonder if you could comment on the state of direct liquid cooling and things like secondary fluid networks and the connection integrity of the boring, but important hoses. It's come a long way, but I remember attending some supercomputing conferences a few years ago and looking at the mesh of hoses thinking, "Wow, this is not very elegant." And now I see them today and they're much more beautiful. Is the liquid cooling ecosystem where you want it to be? Are you happy with the progress that's been made? Can you comment on that?

Gilad Shainer

>> Yeah, we are working in an ecosystem and we have great ecosystem partners across everything we built. Now we have ecosystem partners as part of the co-packaged optics and photonics with ecosystem partners around building liquid-cooled elements that we need and so forth. Obviously, there are things that continue to progress and there is new elements that being created and we're exposed to new technology, which is just amazing. Just amazing. We focusing on liquid cooling for the compute generation that we're driving now out. And obviously, it's going to be integrated part as part of everything else we do. We go to our enabling a 100% liquid cooling on the infrastructure. We build the infrastructure, the scale-out switches and the scale-out switches based on liquid cooling, driving the 100% liquid cooling there, and it's important for us to make it easier for people to consume that or easier to people to build that. And that's why we're sharing the same racks or share the same liquid computing racks for the compute devices, as well of the networking devices or infrastructure devices. And we try to make it very simple to consume, to enjoy and so forth. So obviously, there is more things to do, right? There is never ending, there is no endgame here. We don't see an endgame and this is why it's so fascinating to work in this area. But yes, I think we're very happy with where we are right now, but we also see the other things that will needed to be developed for the coming generation, and this is where our focus is on.

Dave Vellante

>> I have one follow up and one last question I'll let you go. You said 100% liquid cooling. So today, I know a lot of the training is being done on servers that have a hybrid of air and liquid. And do you see that as a bridge that we're ultimately going to be 100% liquid cooling or do you see that hybrid existing or is it presumed it's use case dependent, but for the really demanding AI workloads, will those eventually be and maybe they are already 100% liquid cooling?

Gilad Shainer

>> Yeah, so obviously if you look back, it was majority of data centers were air-cooled and then they start to migrate into hybrid and then to a full liquid cool data centers. And it depends on the target, what data center you build, what's the size of the data center that we have, what facilities you need to use and so forth. Obviously, liquid cooling is the right path. It's a good path to increase compute density. It's the right path to reduce power consumptions and actually be able to increase the capability of the data center. It's the right path to silicon photonics, the integration of photonics into the switch infrastructure and so forth. And therefore, we're probably going to see more and more liquid cooling used in the new data centers that are being built and for future technologies.

Dave Vellante

>> That's amazing. All right, last question. Just zoom out, just your telescope long-term vision. So if you think about five years, I know it's a long time in this era, but what should we expect AI infrastructure, compute infrastructure to look like? Is it going to converge around a handful of purpose-built fabrics or are you going to see, you think a hybrid coexistence of some off-the-shelf technologies that maybe weren't purpose-built, maybe there's some emerging standards. What do you see as the endgame here or the steady state? There really is no endgame as you just pointed out.

Gilad Shainer

>> Yeah. When you talk about outlook of five years, it's very interesting question and the reason that it's a very interesting question is because I would like to know that too. If you look five years back, you would already may five years back where we're going to be today, right? You wouldn't completely wouldn't imagine. It's like the world has changed seven times in that sense. The amount of technology that was created is amazing what happened in the last five years. So if you ask me what's going to be the next five years, it's a really interesting question. So first, I can tell you what we're focusing now, and right now, it's getting co-packaged optics. We're looking on the next generation. Co-packaged optics, one side. We did announce, we just announced the Spectrum-XGS, Spectrum-X for GigaScale Ethernet, and that's become also important because now when you maximize your compute in a single data center, it's not enough for the workloads that you want to run. We see the need actually to connect multiple data centers together to form the gigascale AI factories to support the next generation of workloads. And for doing that, you need to start running over long distances. So we talked about scale-up, we talked about scale-out. Now we have scale across. We have a new infrastructure being created to support data center, to data center connectivity, to enable AI across data centers. And this is what we announced now, and we announced that CoreWeave is the first customer size that is implementing the Spectrum-XGS Ethernet to connect multiple sites together to form gigascale AI factories. So those two things are going to be core elements of the next generation of infrastructure. Moving five years, AI is growing an amazing pace. Every data center is going to be accelerated. You want to maximize your capability in every data center, and therefore you will see purpose built, but industry standard-based and open source to support those data centers, and the cadence, the cadence of the technology. Today, it's an annual cadence. Every year, you see a new infrastructure, you see new switches, you see new NICs, you see new accelerators being built. So in five years, we're talking about five generations. Okay? So this is why I said it's a very interesting question.

Dave Vellante

>> We call it moving at the speed of NVIDIA and the industry is doing that. When are you going to solve the speed of light problem, Gilad? Are you working on that?

Gilad Shainer

>> That's an easy task. That was an easy task, that was already solved.

Dave Vellante

>> Right. Hey, this has been amazing conversation, Gilad. Thank you so much for taking the time. It was really a pleasure having you.

Gilad Shainer

>> Yeah, thank you for having me there, and see you next time.

Dave Vellante

>> Yeah, would love to have you back. And thank you for watching this CUBE conversation. This is Dave Vellante for theCUBE, and we will see you next time.