SC24 | Keynote Analysis, Day 2 with Hasan Siraj

Clips
News
More from SC24

Hasan Siraj

Head of Software Products, Ecosystem

Broadcom

Dave Vellante

Co-Founder & Co-CEO

SiliconANGLE Media, Inc.

John Furrier

Co-Founder & Co-CEO

SiliconANGLE Media, Inc.

Savannah Peterson

Principal Analyst & Host

SiliconANGLE Media, Inc.

From servers to clusters: theCUBE takes a look at how Broadcom is using AI to reshape IT infrastructure

Clustered systems are emerging as the backbone of artificial intelligence infrastructure, transforming how industries manage the immense demands of training large language models and executing machine learning tasks.As traditional computing frameworks fall short, these interconnected networks of servers are driving a new era of efficiency and scalability. However, the success of these systems hinges on strong networking solutions, which serve as the critical foundation for ensuring seamless operations and unlocking AI’s full potential

play_circle_outline Challenges in AI workloads compared to traditional cloud computing

play_circle_outline Innovations in networking technology at Broadcom for AI infrastructure

play_circle_outline Transition from InfiniBand to Ethernet as dominant protocol for AI networking

play_circle_outline Maximizing Performance, Saving Power, and Cutting Costs: The Advantages of Ethernet Chips for AI Infrastructure

play_circle_outline Managing AI Workloads Across Half a Million Nodes in Multiple Data Centers

Info
Transcript

Keynote Analysis, Day 2 with Hasan Siraj

Hasan Siraj

Head of Software Products, Ecosystem Broadcom

Dave Vellante

Co-Founder & Co-CEO SiliconANGLE Media, Inc.

HOST

John Furrier

Co-Founder & Co-CEO SiliconANGLE Media, Inc.

HOST

Savannah Peterson

Principal Analyst & Host SiliconANGLE Media, Inc.

HOST

TheCUBE live coverage in Atlanta for Supercomputing 24 features John Furrier and Dave Vellante. Hasan, head of software products ecosystem at Broadcom, discusses the AI factory and networking's role in supporting generative AI workloads. Hasan highlights the importance of networking in clustering systems for AI, emphasizing the need for robust networking to prevent infrastructure inefficiency. The conversation explores the challenges of AI workloads, such as bandwidth intensity, large flows, and failures in optics. Hasan discusses Broadcom's innovations in ne... Read more

explore Keep Exploring

What is the challenge when trying to optimize server utilization for large language models and clusters due to networking issues in the context of AI workloads? add

What are some examples of innovation that need to happen at different levels of the stack in order to maximize the utilization of investment in networking, particularly in the context of running recommendation models for platforms like Facebook? add

What are the challenges of using InfiniBand technology in HPC environments with extremely large clusters for AI and LLM applications, especially with hyperscalers like GPT-4 with 1.3 trillion parameters? add

What are some advantages of using ethernet technology over other technologies in terms of performance, efficiency, and cost savings? add

What is a fundamental component to managing large clusters with hundreds of thousands of nodes in multiple data centers for AI workloads? add

bolt Powered by CUBE AI

Keynote Analysis, Day 2 with Hasan Siraj

search

>> Welcome back everyone to theCUBE live coverage here in Atlanta for Supercomputing 24. I'm John Furrier, host of theCUBE with Dave Vellante, my co-host. Check out our podcast. We co-host the podcast called theCUBE Podcast every Friday. Check it out. Hasan's back in theCUBE, cube alumni is the head of software products ecosystem at Broadcom. Here to unpack the AI factory, the networking role in it, and all the important considerations to look at when you think about rethinking and re-architecting your infrastructure to support generative AI. The workloads are in demand. People are busy laying down the lines on what's going to be the future for the future runtime of large-scale systems. Hasan, great to see you. Dave, looking forward to this.

Hasan Siraj

>> Thank you John.

>> All right, so we had Dell on earlier, we talked about the AI factory, but there's so much going on. Broadcom has really been leading in the area of enabling this kind of next-gen system. I call the clustered systems, I wrote a post on SiliconANGLE called We Are Now in the Era of Clustered Systems and the thesis was the server connected on a rack, the old days is now a system of servers, a system of networking, a system of components and all kinds of componentry that chips that you guys make. NICs, SmartNICs, ethernet, all working in concert together and it's large-scale super computing capabilities. Basically democratizing super computing for the masses, which is essentially what servers did in the old client-server days and then became the internet. So this is a key area and people are making choices, Hasan. So let's start with what your vision is for how the clustered systems and the role of networking fits in because this is an important part. It may not be the big number where the cost is, but there are consequences if you don't get networking right in the AI paradigm.

Dave Vellante

>> Yeah. Who is that?

Hasan Siraj

>> John, great question. So I completely agree with this question around clustered system and if I go back, we work with a lot of the hyperscalers to build the clouds that are out there, networking for that. And when we were doing this, we heard a lot of things around, hey, networking is getting in my way, how do I optimize the server utilization in that space? But go back three years ago when everybody was starting to approach these large language models and they were trying to build these clusters, they came back with the same question that networking is getting in my way. And this is because AI is a fundamentally different workload and this is a fundamentally different problem compared to cloud computing. Let's just look at training for example. If you are training a large model and these models are growing at an exponential rate, they don't fit in a CPU and a core of a CPU. Virtualization is no play, they will not fit into tens of thousands of cores of a GPU.

>>

Hasan Siraj

>> Hundreds, thousands, maybe tens of thousands of GPUs are required to do that. This is why you cannot fit a model within a server or two servers or four servers, and that is why you need a cluster. And guess what? When you have a cluster and everything is spread out, you need glue to put this all together and that is networking. And if you don't have robust networking to put all of this together, things are going to fall apart. You may have spent a lot in this very large infrastructure, but you would not be able to utilize it effectively.

>> Yeah, I mean everyone focuses on the costs and they are expensive, a rack could be very expensive. You can have a $4 million rack deal, yet the numbers you sized up, but they're not small. But the payout is great if you get the workloads running right because it's a game changer, but it's an extinction event in IT or a failure, critical failure of networking fails because latency matters, connectivity matters and just tapping into the resource. I mean it is the glue. This is a critical infrastructure component of the clusters. It's a make or break architecture. I want to amplify that because a lot of, I think gets overlooked a lot in the larger scope, but certainly if you're designing it, you know it. So I want to unpack where's the innovation coming from? I know you guys are working hard on that Broadcom, but where's the innovation from on Broadcom's side and where does that translate into the practitioner who's building it? Because you got architecture that has to handle choice, whether it's workload choice, routing models, making sure that I'm tapping into the right machine that has the liquid cooling, one that has air cooling. There's all kinds of system component decisions. It's like an operating system. I got to know what to do.

Hasan Siraj

>> So you've got to innovate at all levels of the stack. We build up network silicon, we got to do a lot of innovation there. There's innovation on the system level. There are innovation that needs to happen on the software site. But let's unpack to your words, let's just go back. I mean Meta talked about this a few years ago that look, when I am building these clusters, I'm running these recommendation models. If you're running, browsing Facebook, you're getting recommendations for ads. That's where they use the recommendation models. Up to 57% of the time was being spent in networking. It's a very expensive investment that they've made and they want to utilize it the fullest. Now why does this happen? You really need to understand how machine learning works at a very high level in order to do that. I mean in machine learning you're doing a lot of computation, you're doing a lot of matrix multiplication, then you spit out a lot of radiance and weights with the GPUs exchange with each other. And what happens after that's done? You synchronize and then you start the cycle again. You keep till you get the error levels that you're looking for. Now in this process, first of all, this is hugely bandwidth-intensive process. Servers, to your point, there have been like 25, 50 if you're pushing to the servers up till now that has worked very well. But over here we are talking about 400 gig, moving to 800 gig and in a couple of years, 1.6 terabit port. So extremely high bandwidth. The second thing is the flows in this case are what we call elephant flows. You don't have a lot of flows.

>> Elephant flows?

Hasan Siraj

>> Elephants.

Dave Vellante

>> Elephant workloads.

>> The elephant in the room is...

Hasan Siraj

>> They are big flows. So think of this as you're going on a highway and you have seven lanes. If you're not utilizing those seven years, you're going to have congestion. And in the traditional cloud world with TCP/IP, there are hundreds of thousand flows. They load balance very effectively, but you have to have very meticulous engineering to make sure that is done right. Then when these GPUs converse with each other, a GPU could be communicating with six other at the same time and then you can, this is what we call incast. You are overwhelming a GPU and how do you deal with this incast problem? Now these training jobs will run for days. If you're running for days, I think what you'll find is one of the biggest failures when they're building this network, besides the GPU, is optics. Optics fail. If you assume a 2% failure rate, that's about 15 failures per month in a 4,000 node cluster. Now guess what? If you don't recover from that failure very well, you are going to go back to a checkpoint and say, start the process, look what you've done to your training job. So these are these essential things that need to be solved from a networking perspective. How do you give the highest density? How do you give the highest performance? And then how do you load balance? How do you do congestion control? How do you recover from failures? And that is something that we have incorporated in our silicon, in our roadmap. Specifically, speeds and feeds are extremely important and we have been on this cadence of doubling the density of performance every two years. We have done this like clockwork over the last decade. We have a 51.2 terabyte Tomahawk five right now. You could predict when the 100 terabyte is coming and that will keep happening. Same thing, we have a huge investment in terms of building out AI specific NICs. We have a 400 gig NIC. You can predict a roadmap to 801.8.

>> So hold on, one quick follow up and then I'll pass it to Dave. So you're saying what that training, the job completion is critical. That's the end game. And if that's not happening, that's going to come from the bottleneck and the network because job completion, whether you're training or what the workload is, the bottleneck in the old days was jitter, latency. That's what you're getting at. Is that what you're saying?

Hasan Siraj

>> You hit the nail on the head. That is the most critical metric that everybody should be looking at, job completion time. You talk about latency. Latency in the case of AI workloads is not, latency through a switch is not that important. It is when do you receive the last message because that's what we call tail latency. So you want to minimize the tail latency, which will minimize the job completion time. And that can only happen if you have the best load balancing, the best congestion control and best failover.

>> And that's where the GPUs are excelling. They're actually making the new jobs, which is new category happen faster. So it's not happening, it's not worth it. I mean that's basically critical fail. That's the connection point.

Dave Vellante

>> Hey, let me in. I don't know if you guys saw this in the Wall Street Journal today. It isn't just data centers, AI's plumbing needs an upgrade. I thought they were going to talk about liquid cooling, but it was an article about networking. Now they're not wrong because they talk about bandwidth, but they were talking about Cisco. I mean this is not an old school router and switching problem. What you just described, Hasan, is sort of a new thinking, new era, new bottleneck around networking. So I'm really interested in helping the audience understand the connections within the XPUs. As you mentioned this, you talked about some of the parameters, the weights. So you have compute going through the roof, you have data like this and you have the weights and the parameters, all three kind of scaling together. So the XPUs are talking to each other. And then as you described, tens of thousands, half a million, maybe even someday, not maybe, a million GPU clusters. So different types of networks that you guys are driving that are going to become the dominant. Yeah, we're still going to have switches and routers and those sort of old school internet, John, that you know so well. But this is a completely new refresh of the existing infrastructure and I wonder if you could explain those dynamics.

Hasan Siraj

>> No-

>> And by the way-

Hasan Siraj

>> Very good question.

>> By the way, the article didn't even mention Broadcom, which is at the heart of this. Obviously NVIDIA is there as well.

Dave Vellante

>> It's Wall Street Journal. I mean that's not really and it's media.

>> But it was a great, but their timing was good, right? I mean they're right, they're not wrong, but they don't go deep enough. So I'd love you to do that.

Hasan Siraj

>> No, absolutely. So when you're building the AI infrastructure, we think of it in two ways. We call a scale up network and a scale out network. Now the scale up network is where, to your point, the XPUs are connected together. There is this desire to have that domain as big as possible. This domain is probably like 64 GPUs today, will probably move to 128, 256 moving forward. The requirements for this domain are slightly different compared to scale up. Scale up is like you have this scale up domain. Let's say it's 64 GPUs, but you have 10,000 of these domains and you need to connect all of them together. That is the scale out network. So we do both. And the scale out is something that we have talked a lot, I've described a lot, described the requirements around is like this is ethernet is now becoming the de facto standard for the scale out network. There are the largest clusters out there, 100,000 GPUs, which are based on this, Meta at OCP announced using, leveraging some of our technology that they have built very large clusters. InfiniBand was an option at this point of time. It may still remain in some cases, but dominantly the world is moving there. We also believe, or the scale up, which is where you are connecting the XPUs together, ethernet is the right technology. The requirements there are slightly different because over there latency becomes a more important factor. You need to be able to cater to very small packets at very high throughput. You need to have what we call link level refries. You need to be able to optimize the headers, to utilize the bandwidth effectively.

Dave Vellante

>> Because you're managing asynchronous distances, right?

Hasan Siraj

>> Correct. Correct. But we believe ethernet has the capability to solve that as well. Now NVIDIA has a technology called NVLink, with which we do this, but we believe that just like standards and open have prevailed for scale-ups, that is what everybody is looking for, scale-up as well. They don't want to be tied down. Ethernet has the capability to get this done and that will prevail.

Dave Vellante

>> So what changed? I mean four or five years ago it was like, oh, InfiniBand, that's going to be the dominant high performance, high bandwidth ethernet. It's got its place. But now you're talking about ethernet becoming a dominant, the dominant protocol. What changed?

Hasan Siraj

>> I think InfiniBand was being used for a certain size clusters in HPC environments. It worked very well in those proprietary domains. What has happened is it is this whole wave around AI and LLMs, especially with hyperscalers, when you have to go to GPT-4 with 1.3 trillion parameters and you've got to train this extremely large clusters. And to your point, this is today, we are talking about 30 to 64,000 GPU, maybe 100,000 GPU clusters today. But this, a million is not far away. When you are trying to do this, a technology like InfiniBand just cannot scale. It does not have those failover mechanisms. And plus there is no ecosystem. Ethernet, the thing about it is standards-based, it's a large ecosystem. There is people who know how to manage ethernet-based networks. There are troubleshooting tools, monitoring tools that are available. People are like, I don't want islands, technology islands. Whenever you're building an AI network, you have a front-end a back-end, a storage and an out-of-band management network. That's all ethernet. So it's a standard way of managing all.

>> It's funny, in the article, Dave says, NVIDIA makes a networking platform called InfiniBand, removing large amounts of data. And then it goes in, it says "Ethernet, a competing platform considered less mature for AI networking," they do that in the mainstream media. They've explained with the comma, but they say it's less mature for networking. Actually it's been around for longer.

Dave Vellante

>> In TCP/IP.

>> I love how they do that, ethernet comma. For people who don't know what it is and they got it wrong. So InfiniBand isn't more mature than ethernet, number one. It's good for one use case, moving between small nodes. Okay, that's it. I mean that's pretty much it. Yeah, other things, but that's pretty much it. Ethernet is everywhere. It's connected to other systems .

Hasan Siraj

>> Even for the AI, the largest clusters are all based on ethernet, right?

>> Yeah.

Hasan Siraj

>> I mean I think-

Dave Vellante

>> But this is not debatable. Even NVIDIA is doing ethernet. I mean it's not like the market's not-

Hasan Siraj

>> Absolutely, they actually we are all for it, right?

Dave Vellante

>> Yeah, why not?

Hasan Siraj

>> We want, NVIDIA is doubling down on their ethernet roadmap and that's the beauty of ethernet. You have other players out there and what it gives customers the choice, it brings the cost down, it keeps the cost in check. And what we say is made the best execution win.

>> In the evolution and innovation on the chip side, ethernet also has an advantage. I saw Broadcom present ethernet in the actual fabric of the board. Ethernet can be embedded into the system in ways that are more effective in the cluster. I think you guys are building chips in this area where ethernets in substrate. So take us through why ethernet is good for Broadcom, but then on the open ecosystem side, how does that extend out to the value? I mean, is it less expensive to deploy? Is it better energy? Take us through where the ethernet fits into, as the chips get better, there's limited space, they come in .

Hasan Siraj

>> So I mean ethernet is, first of all, it is ahead of any other technology by a generation. I talked about this 51.2 terabyte chip. It's at least a generation ahead. And this has been in production since March of 2023. So when you are a generation ahead in terms of performance and density, you can replace six systems with one system. So think of you had a previous generation six boxes, you can replace this with one single box, that's about a 75% reduction in power. Imagine what you're saving on space, on cooling. And plus these are very simple systems. You don't need large chassis to do this. You can pack so much. I mean Tomahawk five, you can build 64 ports of 800 gig, 128 ports for 400 gig. And two

>> And the AI fabric you guys have, Jericho.

Hasan Siraj

>> Exactly. So we have a Jericho, which is basically a DDC fabric, and you can build a 32,000 node cluster with just pizza boxes, very small pizza boxes. So it's simple, saves you on power, saves you on cost. And plus, like I talked about, there's a huge ecosystem. The ecosystem knows how to manage this thing, how to troubleshoot this thing and the cost. At the end of the day, the cost of ethernet is absolutely lower than any other technology. And when you have many other players doing this, it's going to make sure that-

Dave Vellante

>> Ecosystem....

Hasan Siraj

>> it's cost-effective.

>> Hasan, great to have you on. Final point before we close, I know we've run a little bit over time. I want you to touch on the ecosystem with the consortium, Ultra Ethernet. This is now more proof points. Give a quick plug for what's going on there. I know you're involved as a collection of open vendors in there and why that's important.

Hasan Siraj

>> Yeah, Ultra Ethernet Consortium, like we discussed, there are clusters with 16,000, 32,000 nodes out there. But as we look forward, there is already talk about, Microsoft has talked about a huge investment. You'll see clusters half a million, a million nodes. And when you have these, there are all sorts of challenges come to the table. And when we talk to a lot of people, we're like, "Hey, there are a lot of miracles that need to happen. You focus on your own miracle." Imagine these are not going to fit in one data center. They're going to go across multiple data centers, which are tens of hundreds of kilometers apart. How do you make it all work? How do you manage failover, power, cooling? How do you do connect? How do you manage latency? But one important component is fundamental to any AI workloads is RDMA. RDMA is the protocol. And when you run your ethernet, it's rocky, right?

Dave Vellante

>> Yep.

Hasan Siraj

>> Which is RDMA what converts ethernet. Now, this is two decades old, so this is-

Dave Vellante

>> Mature as you said.

>> That's for sure. It's been around for a while.

Hasan Siraj

>> You know this was-

>> It's got multipath and selective transformations. It's got all kinds of out of order packet knowledge.

Hasan Siraj

>> Exactly. So the RDMA was, yes, it was useful in but if you want to go to this bigger scale, you need to be able to fix it. You need to bring multipathing, to your point, out of order placement, selectively transmit. And these are the kind of capabilities that are being discussed. Now, there are more than 100 members in UEC, so let's talk about the ecosystem. And even NVIDIA is now part of the UEC, which is awesome. So I think that's what is happening. People are standardizing on these implementations. So people have a standard implementation which can allow them to scale down the road. So no matter what size clusters they've gotten.

>> More proof, open source wins, open ethernet is winning. It will be the fabric. Hasan, thanks for coming on theCUBE. And of course we're going to do more in our AI research, our AI labs we're building. So stay tuned for more CUBE after this short break. I'm John Furrier, Dave Vellante with Savannah Peterson and Kristen Nicole here on the ground at Supercomputing 2024. We'll be right back.

>> Easy.