SC24 | James Wynia, Dell Technologies & Hemal Shah, Broadcom

Clips
News
More from SC24

James Wynia

Director Product Management, Sr Product Manager, Sr Analyst, Networking

Dell Technologies

Hemal Shah

Distinguished Engineer & Architect

Broadcom

Dell and Broadcom outline plans to drive scalable AI networks at SC24

The high-performance computing industry is rapidly advancing to meet the demands of artificial intelligence and machine learning workloads. At SC24, there’s been a major focus on scalable AI networks and what comes next.There have been plenty of developments on that front over the past year. It involved a focus on networking and how to build a big fabric given that AI and machine learning is all about scale, according to Hemal Shah (pictured, left), distinguished engineer and architect at Broadcom Inc.Hemal Shah and James Wynia talk to theCUBE during SC24.“Dell and Broadcom with our other partners, we are working to build really high bandwidth, high network utilized fabrics,” Shah said

play_circle_outline Scalable Fabrics Networking: Dell, Broadcom, and Partners Develop High-Bandwidth Solutions for Efficient Networks

play_circle_outline Advancements in AI fabric and Ethernet technology for low latency and scalability

play_circle_outline Enhancing AI Factories: Dell PowerSwitch and Collaboration in Ultra Ethernet Consortium

play_circle_outline Importance of open standards in Ethernet for AI ML networking

Info
Transcript

James Wynia, Dell Technologies & Hemal Shah, Broadcom

James Wynia

Director Product Management, Sr Product Manager, Sr Analyst, Networking Dell Technologies

Hemal Shah

Distinguished Engineer & Architect Broadcom

High performance computing experts in Atlanta, Georgia are discussing the scale of artificial intelligence and machine learning, focusing on networking. Dell and Broadcom are collaborating to build high bandwidth fabrics for AI and ML with a validated design for simplified deployment and monitoring. Large scale fabrics with low latency and scalability are crucial in AI and machine learning. Dell's PowerSwitch, along with Broadcom, is vital for building AI clusters. Ethernet, led by the Ultra Ethernet Consortium, is essential for networking offering resiliency... Read more

explore Keep Exploring

What recent project involving AI ML, HPC, Dell, Broadcom, and other partners has resulted in a validated design that is now public? add

What features are included in the fabric component of the Dell PowerSwitch, and how do they help with congestion control in network scalability? add

What are some of the upcoming developments in the Ultra Ethernet Consortium and how is Dell involved in it? add

What are some key considerations when looking at network scalability and resiliency in the context of AI ML networking and the evolution of GPU clusters? add

bolt Powered by CUBE AI

James Wynia, Dell Technologies & Hemal Shah, Broadcom

search

Savannah Peterson

>> Good afternoon, high performance computing fans, and welcome back to Atlanta, Georgia. We are here coming to the conclusion of day three of our three days of coverage on theCUBE. It has been a fantastic week. My name's Savannah Peterson, I'm here with John Furrier. I love doing this show with you every year.

>> Yeah, this show is all about the technology and show me the evidence that the scale could be there with artificial intelligence, machine learning just exploding on the scene. The demand for high performance, scalable fabrics networking is huge. And we've been pointing out networking is the key. This topic will be great. This segment should address all of that.

Savannah Peterson

>> Well, it doesn't work if the networking doesn't work, so it's absolutely imperative. Jim and Hamil, thank you so much for taking the time to come hang out with us.

>> Our pleasure to be here. Absolutely.

Savannah Peterson

>> It's probably been a busy week for you two.

>> It is. in a good way. In a good way. Lot of excitement here.

>> Yeah.

Savannah Peterson

>> Yes. And I feel a little bit like you saying that I feel like what we do now is a little more mainstream. We used to kind of be like the cool nerds in the corner talking about stuff, and now I feel like everyone's kind of interested in what we're doing.

Savannah Peterson

>> That's right.

>> Yep.

Savannah Peterson

>> Let's talk a little bit about the partnership. Y'all have been working together for a while.

>> A decade.

Savannah Peterson

>> Well over a decade.

Savannah Peterson

>> Yeah. I was going to-

>> I was thinking I can remember at least 15 years myself and I'm sure it went before that, so-

>> Yes, or it's two decades about.

Savannah Peterson

>> So not a casual amount of time.

>> No.

Savannah Peterson

>> We've obviously gone through a lot of different innovation cycles, hype curves, everything else. Here we are in AI ML, workload, land. Talk to me about what you two are working on specifically right now. And Hamil, I'll start with you.

>> So yeah, on the AI ML last year we talked about HPC. We were working on the networking, how we build this big fabric because AI ML is all about scape. And so Dell and Broadcom with our other partners, we are working to build really high bandwidth, high network utilized fabrics. And in partnership, what we'll bring together is a lot of the software integration, the whole diagnostic monitoring of the fabric, which makes life easy for deployments. So that's what we have been working together for a while, and now we have a validated design, which is already public. Congratulations.

Savannah Peterson

>> Congratulations. It's a big Deal. We're at an era where we're not just going up incrementally, but real orders of magnitude in terms of scale. How are you managing and collaborating on those performance requirements? I'll turn to you, Jim. Just kick us off.

>> Absolutely. And we're so happy to be here. We had so much fun last year when we were here talking about this, as Hamil said, and-

Savannah Peterson

>> Well, that's exactly why we made sure you came back again. We love fun people.

>> Yes. And we predicted that when we came back we would be talking about a hitter gig, and that's definitely one of our topics today. But yeah, the large scale fabrics are critical. I'll start by just saying one rough diagram that we have, which is based off of a great key collaboration that we have where we run say the Tomahawk 6 Basic in our new switch, the Z90E64. We've worked very closely with that. We lost that about a quarter ago, and we're saying the sales are going like that. And running on top of that switch is the OS called Sonic, which is based off of open source, open standards, and we also collaborate with Broadcom on that. And then to manage that, management being so key, we have SFM Smart Fabric Manager, and so that brings the whole picture in terms of the latest NPUs, the latest software developments based on open standards and the latest fabric management under one solution.

>> Talk about the ethernet role of networking and the AI fabric, specifically the low latency and the scalability are two areas people are talking a lot about. What have you guys done there this year? What's different? Because you're starting to see the AI factory come into visibility, get the whole factory here in a truck and the booths showing the new products. You brought the steak to the party, as they say, sizzling the steak. What's the-

Savannah Peterson

>> You're making me hungry.

>> What is-

Savannah Peterson

>> This is the second time?

>> Talk about the advancements on the AI fabric and the ethernet piece, because latency and performance.

Savannah Peterson

>> Yeah, so-

>> You going to take first up?

Savannah Peterson

>> Yes.

>> Jump off.

>> We'll start with the fabric. So latency and performance is already there. I brought some of the things-

Savannah Peterson

>> I was just going to say where you show us those toys.

>> So this is 51.2 terabits per second, Tomahawk Five, which is in production with Dell Switches.

Savannah Peterson

>> May we take a look at that?

>> Sure.

Savannah Peterson

>> Can I touch it?

>> Yes.

>> You have to give it back.

Savannah Peterson

>> Yes. I mean what? No, I'm kidding. I'll hold this up for the guys here to get a little ice if you'd like so we can see what we're working with. That'll even get-

>> That's the Dell PowerSwitch, right? The Dell PowerSwitch?

>> Yeah.

>> All right, good.

>> So let's start with that. That's the fabric component. And then you are going to build two tier three tier networks with this assist switches in the middle. These are hydraulic switches. They have a lot of telemetry features which help the congestion control. And also in network now as you go to large scale, we have so many paths. So these switches have intelligent load balancing algorithms, which allow you to utilize the network best.

Savannah Peterson

>> Exactly. I know, I was just going to see if you're talking.

>> That'll be on eBay in about 10 seconds.

>> So this is one of that kind of endpoint schedule fabric you build, but some customer want to build, this is the Jericho III AI. So where Switch itself has all the congestion control and transport logic in there, and your connectivity to there is 400 gig, 800 gig. So here is another one. So these are two different types of fabrics you build on the front end. That's one point. And then there is the NIC, which is providing you the RDMA features. 400 gig now, 800 gig in the future. So together, all of this come together to give-

Savannah Peterson

>> I'll hold it up, sorry....

>> fabric, and then all the software you put around and integrate with the components from Broadcom, from the GPU vendors, as well as Dell, that makes the whole solution stack how you run the family.

>> And the routing piece is still there? Adaptive routing still features?

Savannah Peterson

>> Adaptive routing is the feature of the switches.

>> Okay. Yeah. That's awesome. Yeah.

Savannah Peterson

>> So cool. Did you intentionally make it so reflective I could check my lipstick as you passed it over? That was extremely thoughtful. That is really inclusive in your design over there.

>> We try to please.

Savannah Peterson

>> Yeah.

>> So Jim, if I'm putting together an AI factory, the PowerSwitch is key to that piece. Right?

>> Absolutely.

>> So I got the clusters.

>> Yeah.

>> PowerSwitch with the Broadcom, has the fabrics, compute, GPU, storage, fabrics, all kind of built in. Take me through the role of that switch.

>> Yeah, you're absolutely right and we all realize that there are options out there, but really where you want to end up is a completely certified validated solution so that you take the guesswork out. This is really going to work. I'm plugging in Legos here and there, if you're doing that, then you're rolling the dice and I'm not going to crap, so I'm going to go to Vegas.

>> By the way, GPUs, they can't be wrong. They got to be right all the time because they got to be working. They're producing job completion.

>> They're very expensive.

>> The KPI everyone's talking about here. So no job completion, the GPU's going to be out of cycle, all kinds of inefficiencies.

Savannah Peterson

>> Right, right. So that's one of the areas where Dell can bring the full picture of the whole solution to bear, a fully certified solution. Dell is a absolute leader in the server market, absolute leader of the storage market, pushing the boundaries of the fabric that you want more than paperweight. You want these things to talk to each other. That's where-

Savannah Peterson

>> Oh my gosh, yeah. Most expensive paperweight ever, if that's the case.

>> Absolutely. Absolutely. And consuming a lot of power. So getting the right fabric in there is absolutely critical. So that-

>> You've got the advanced networking hardware, you got the Broadcom relationship with the chips and software.

>> That's right.

>> Every year you guys do a great job of raising the bar on performance. What's-

Savannah Peterson

>> Where should we go from here?

>> Yeah. What are you guys working on now? Because this is an engineering show and engineering wins customers. That's what we're seeing here, because again, the areas that we're solving is space, energy and price performance. These areas, people are working hard on this. These are the high stakes of this game, so to speak, a gambling analogy. You miss one of them, you're out.

Savannah Peterson

>> You are out.

Savannah Peterson

>> Single-native.

>> Talk about craps, these are the three hardest areas that everyone in the show's working on. What's the engineering focus?

>> Jump in? Go ahead?

>> Yeah.

>> He's got privy to some great data, but a couple areas that I see coming for sure is we talked about Asia last year and yeah, we've released 800 now. Okay, but do you see 800 to the servers today? No. Because of that little nick thing, we need the next generation. He could talk about exactly when that comes and what that means, but you need that to get to the server. But 800 gig is available in the fabric. We'll double that. They'll talk about that. He'll have a new piece of Silica next year, not to steal your thunder, but really an area that is super exciting is the Ultra Ethernet Consortium. And honestly, Broadcom is leading the Ultra Ethernet Consortium with peers in the industry doing a fantastic job. Dell also participates in that.

Savannah Peterson

>> How big is that Consortium?

>> Oh, that is a good question. Do you know how many members?

>> It's a hundred of companies.

Savannah Peterson

>> Whoa, really?

Savannah Peterson

>> It has grown very fast.

Savannah Peterson

>> That's awesome. Wow.

>> But the good news is that the fear was that, "Oh, there's going to be a bunch of cats in there and they're never going to resolve it." I think that they've done a good job of resolving down, and so we're expecting to see a ratified draft by Q1 or in Q1 of this year. That is great because now we can take the best and brightest ideas that we know are working, get everybody on the same page. A lot of stuff coming from Hamal's work honestly is going to make it in there. And so we know that it's going to be a quality spec that we can use that'll help lower the threshold in terms of bringing additional quality to Ethernet. And I think of it really as like, I know McLaren's a big Dell fan, they're very public about it. They have a fantastic race car today. Next year it'll be a whole new car. They're going to make it better, they're going to get additional telemetry, they're going to implement new things, and that's how I view the Ultra Ethernet. IP is firing on all cylinders right now, but this takes us to a whole new level.

>> And F-1's in Vegas this weekend.

Savannah Peterson

>> Are you going?

>> I'm going to pop in and maybe I'll be there this weekend. Yeah, heading up there. Be there.

>> Have fun. That's great. Yeah, that's cool.

>> You mentioned herding, like you said cats, which maybe herding-

>> Yeah, I know. I love cats.

>> So because I think this is a huge topic, in fact, the Wall Street Journal had an article, Dave brought this up in our podcast two days ago, it said it isn't just about the data center. AI's plumbing needs an upgrade. It's funny, and they actually got an error in the story. I want to call it out, but as it comes up here, it mentions InfiniBand and it says InfiniBand is moving large amounts of data. They're talking about Nvidia. And then it says, Ethernet, comma, a competing platform considered less mature for network. And clarify this because it's not less mature. Ethernet is the most mature standard and it's not herding cats when everyone is open and wants to go there. So that Consortium is working because there's no need to herd. It's a stampede to the answer because that's where it's going because everyone wants open. Talk about the Ethernet maturity because the speeds are coming faster, you guys are doing that at Broadcom, we covered that before, but the importance of open ecosystem and why that makes Ethernet a better topology. Jump ball.

>> So of course Ethernet is our religion as you know. So Ethernet is built on open standard, not on open ecosystem. And one of the things we talk about speeds and feeds, and 800 gig, 400 gig, but that's not the only thing. When you start looking at the AI ML networking, the scale becomes very important. And that's where some of the shortcomings of InfiniBand then become visible. What we are seeing right now, 10K GPU cluster is going to be one million GPU cluster going forward. Ethernet with UEC and other kind of features is going to have more resiliency in the network, fabric resiliency. How quickly you adapt to the link failures, that will decide your job completion time. Also, the transport enhancements that are coming with RDMA is going to make long distance, because geographically you won't be able to fit one million GPUs in one area. So you will be now talking about going this data over hundreds of kilometers. So that's where another set of enhancements that are coming on the Ethernet side is going to be able to allow you to do that. And finally, the congestion control. Ethernet has always done well with the open telemetry base end-to-end congestion. We will continue to announce that. These things all together, it's going to really make AI ML networking deploy at large scale with Ethernet.

>> And the geography, they just pointed out that distributed computing architectures are now the standard. You have to be connected. And this is a huge piece, it's a huge parts of the puzzle. Sorry, I didn't mean to interrupt, but I wanted to amplify that point.

>> Absolutely. You're spot on. And let's be clear, there's no debate. Ethernet is the de facto standard for all things networking. Over the last 25, 30 years, they won. Okay, yeah, there are niche opportunities where things show up like InfiniBand and we can list a bunch of others, but if you stack up all the networks, IP is by far the dominant player.

>> Yeah. And the performance is there too. Again, we've crushed the InfiniBand debate on theCUBE here many times, we don't need to go there, but the point is that just the mainstream media, they just didn't get it right. That's the whole point. The networking is so important for the clusters that if you don't get it right, the whole GPU purchase fails. It fails to do its job on why I was hired for. Again, it seems like a small line item from a dollar standpoint, but it's a critical piece of the puzzle because it's inside the cluster and also outside, to your point, Hamil.

Savannah Peterson

>> Correct.

>> Yeah.

Savannah Peterson

>> Y'all have been collaborating together for such a long time and you've seen a lot of different eras of technology. Do you think that our current AI moment that we're having right now is accelerating the collaboration in these Consortiums to get to a place of collective betterment?

Savannah Peterson

>> Absolutely. Because yeah, James can say what we are talking about in, let's do this in three years, now we are talking about let's do this in next six months or a year. And-

>> That is literally how it works.

Savannah Peterson

>> That's literally.

>> Literally.

Savannah Peterson

>> That's awesome.

Savannah Peterson

>> And we were talking, "How about these features?" Now we are talking about how about this set of features and so many things happening so fast, and that's what we have been working together and trying to accelerate.

>> Yeah, it is interesting. It felt like we all settled into, okay, this is what it takes to make an ASIC and to do a feature and to integrate and converge solutions. And then when the opportunity came with the whole AI explosion, we realized we can do it a whole lot faster. Every single company that we're working with is like, "You're doing it how fast?" And that's the way it's working. And these solutions, they're quality solutions.

>> My final question to you guys, again, this is another jump ball, but I'd love both of you to weigh in on it, fabrics was a networking thing, especially when you start to get into the hardcore infrastructure where everyone's playing right now, that's where the innovation I.s as the fabrics evolve and become almost adaptive in quotes, but these clusters are going to be built for a large scale and they're going to have a lot of multipurpose workloads that may look different at any given time. So adaptive fabrics, that's my word, it's not necessarily an industry word yet, but-

Savannah Peterson

>> It can be now....

>> you're starting to see these clusters have all the fabrics. It's not just the plug into the network. The fabrics define the glue in the work for the workload. So this becomes a huge part of the learning, the architecture. Talk about the importance of the role of the fabric generally, and then how people manage it and what's the future look like for the fabrics.

>> Have you been sitting in on my customer meetings?

>> No.

>> I've had the same discussion several times in the last two days and I'm going to let him jump in first.

>> As you mentioned, fabric management becomes a huge challenge for anybody who's deploying this at scale. So you ought to make it easy. Easy to deploy, easy to monitor, easy to manage. Also, when, what do you call adaptive, you need to now have fabrics smart enough that they have built-in resiliency. The transport or reliability is also built-in. And a lot of these, you don't make it visible to end customer because all they want to do is run their workload. They don't want to worry about, "Oh, this link went down or so." So these are the kind of intelligence through software, we bring it in and then take that into account and make fabric deployment easy and make fabric running all the time.

>> We don't have enough time, but I'd love to do a dedicated segment on resilience because resilience has been a cyber thing. That's ransomware, backup and recovery came from the storage side, but no one's really nailed that gen AI resilience story. What is resilience in AI up and down the stack? Because you got-

Savannah Peterson

>> And that's quite a question.

>> We're not going to do it.

Savannah Peterson

>> I was going to say, "So that's Tuesday's podcast."

>> We'll get back to it. I know we don't have time and we don't have to answer the question, but this is an open question because resilience is business value.

>> Absolutely. Absolutely.

Savannah Peterson

>> Yep, absolutely.

>> It's money.

>> A simple answer would be if I'm able to run my workload without having no check pointing that I have to go back and all the time deterministically finishing my workload, that will be the resiliency. Then I don't have to worry about what's happening inside system and fabric. But of course we can talk.

>> Yeah, we'll definitely follow up.

Savannah Peterson

>> Great. I know, I was going to say, "I'd love to unpack it." That was a great single shot there though. I appreciate that. I have two more quick questions for you. Number one, we talked about herding cats and I have to ask about the dog on the back of your phone.

>> Oh.

Savannah Peterson

>> So cute.

>> My husky? Yes.

Savannah Peterson

>> Oh my gosh. Hold that up so the camera can look at how cute this dog is.

>> All right.

Savannah Peterson

>> This has been a really pleasant .

Savannah Peterson

>> He's so happy. .

Savannah Peterson

>> We let folks say hi to their families sometimes on the show. I am a dog owner. I feel like it's important to say hi to our dogs at home watching. Obviously, now I sound insane, which is great. Last question for you both. We did this last year, same time. We're definitely going to do it next year, same time, and hopefully actually many times in between.

>> Absolutely.

Savannah Peterson

>> What do you hope to be able to say at Supercomputing 2025 that you can't yet say today? Jim, I'll start with you.

>> Well, I made a strong statement when you asked me that question last year, and I said, we will be talking about 800. It's in the booth, stop by. Next year this time we will be talking a lot more about 1.6 where everything doubles, and we'll certainly be talking about the buzz that gets generated as the UEC spec gets ratified and released, and we start seeing progress when we start seeing announcements from our friends making silicon as well as software on how to implement that and the advances.

Savannah Peterson

>> I love it.

>> How are you?

Savannah Peterson

>> You're good at claims. We'll definitely be following up with that sound byte. Hamil?

>> I'll continue on this trend. Next year we'll talk about, again, scale, scale and scale, but also we should be able to talk about some of the enhancements we are doing at the solution, which are already in the works, but it'll be more mature next year, so we should be able to talk about those.

Savannah Peterson

>> Well, we can't wait to host you again. Hamil and Jim, thank you so much for taking the time during such a busy week.

>> Thank you for letting us to come and just have a discussion.

>> Awesome.

Savannah Peterson

>> Hey, we're always here to talk hardware, huskies, and anything else you've got.

>> Hardware and huskies.

Savannah Peterson

>> Hardware and huskies. I'm a UW Husky. Let's go dogs. I Might throw that out there. Why not.

>> Why not?

Savannah Peterson

>> John, always a blast to hang out with you here on the desk, and I hope you're all having as much fun as we're having here and perhaps cuddling your dog or whatever you might be doing while you're tuned in for our three days of live coverage here at Supercomputing 2024 in Atlanta, Georgia. My name is Savannah Peterson. You're watching theCUBE, the leading source for enterprise tech news.