SC24 | Armando Acosta, Dell Technologies

Clips
News
More from SC24

Armando Acosta

AI/HPC Product Planning

Dell Technologies

Three insights you might have missed from SC24

High-performance computing innovations are redefining the future of enterprise computing, pushing the boundaries of scalability, sustainability and innovation.At the heart of this transformation is the emergence of scalable AI infrastructure, which is democratizing supercomputing and making advanced technologies accessible to enterprises of all sizes, according to John Furrier, executive analyst at theCUBE Research.On the set at SC24 with theCUBE Research’s John Furrier, Savannah Peterson and Dave Vellante.“I think this year you’re starting to see real build-out around the infrastructure hardware and where hardware is turning into systems,” Furrier said in during the recent SC24 event

Direct liquid cooling emerges as key to exascale computing efficiency

Exascale computing is no longer a distant concept, but a reality reshaping how massive computational tasks are approached.However, this rise has brought new challenges, particularly in thermal management. There now exists a growing adoption of direct liquid cooling to manage the intense heat generated by powerful CPUs and GPUs, according to Armando Acosta (pictured), director of HPC product management at Dell Technologies Inc.Dell’s Armando Acosta discusses exascale computing with theCUBE at SC24

play_circle_outline Rise of exascale computing and challenges

play_circle_outline Standardization and modularity in server design

play_circle_outline Enhancing Serviceability and Performance with Hybrid Air and Liquid Cooling Techniques for CPUs and GPUs

play_circle_outline Managing silicon diversity for customer flexibility

Info
Transcript

Armando Acosta, Dell Technologies

Armando Acosta

AI/HPC Product Planning Dell Technologies

The rise of exascale computing presents challenges, with direct liquid cooling technology playing a key role in achieving high performance. Competition among Intel, AMD, and NVIDIA drives innovation for customers. Dell focuses on standardization in server designs to lower costs and improve serviceability, collaborating with OCP to develop a new form factor for enhanced performance. External power shelves ensure uninterrupted performance even if power supplies fail. Dell's hybrid air and liquid cooling approach addresses power efficiency and serviceability iss... Read more

explore Keep Exploring

What is driving the need for more direct liquid cooling technologies in response to the rise of exascale computing and the challenges it presents? add

What design considerations did Dell take into account when working with OCP to create the new 21-inch OVR3 rack? add

What are some innovative cooling methods being implemented in data centers, and why is the hybrid air and liquid cooling approach preferred by some companies? add

What benefits does driving a new spec called DCMHX provide for our server line in terms of flexibility and time to market? add

bolt Powered by CUBE AI

Armando Acosta, Dell Technologies

search

>> Good afternoon, high-performance computing fans, and welcome back to Atlanta, Georgia. We are coming to the conclusion of day two of our three days of coverage here on theCUBE. My name's Savannah Peterson, enjoying the ride with Dave Vellante this time around. We're learning some cool stuff up here.

>> And we're going to talk HPC and we've been talking AI all week. We're really going to dig into high-performance computing now, which is what the show is all about.

>> I know, and I've got stickers that say that. It's the theme of the show. Armando is one of the best people we could possibly have talking to us about it. Armando, welcome back.

Armando Acosta

>> Thank you so much. It's a pleasure to be here with you. We see each other once a year, and I'm looking forward to it every year.

>> I know.

Armando Acosta

>> Talking with y'all.

>> I get excited when I see you on the schedule, because I know you're always going to give us the lay of the land. You're a bit of the trend-eye forecaster.

Armando Acosta

>> Oh yeah, that's my job. If I don't do it right, I might not have a job.

>> I think in this space with your brain, you're probably pretty safe. Talk to us, since we saw you this time last year, what are some of the trends you've really started to notice taking uptick now?

Armando Acosta

>> Well, when we talked about it earlier, we actually do see the rise of exascale now. exascale's been talked about for last two to three years. But if you look at the rise of exascale, really what you're starting to see now is with the rise of exascale and these large machines and these large HPC supercomputers, guess what? New challenges arise when you try to go to that skill. So when you look at exascale, what is driving is more direct liquid cooling technologies. If you want the highest performance, you want the best CPU, the highest performing GPU, guess what? You have to do direct liquid cooling. But not only that, you also see the competition, which is good. What I say is, "Hey, now you've got products from Intel. Now you've got products from AMD. Now you got products from NVIDIA." And then when you think about this, what this does for customers is, "Hey, how do I address the silicon diversity and how do we enable our customers to have flexibility and choice no matter what technology they want to put in their server?"

>> That approach to agile innovation I think is absolutely so critical and something that you are paying attention to. How do you build and design these systems that are going to be able to adapt into the future? There's got to be a lot of standardization that you're advocating for.

Armando Acosta

>> Oh, yeah. So you hit the nail on the head. At Dell, we're all about standards, because standards are good for everybody. Not only that, standards drive down costs for our customers, and that's what our customers want. But what's unique, we will be talking a lot. I don't want to steal Arun's thunder, because he's coming up next, but what we're doing is we've been working with OCP. We've actually got a rack in our booth. It's an OCP OVR3 rack, 21-inch design. But what we're now looking at is if you want the ultimate performance, it's not going to fit in a 19-inch form factory anymore. So therefore, you want to go to a 21 inch. But when you look at that 21 inch, when you try to pack in direct liquid cool manifolds, quick disconnects, power distribution units, it starts to get really packed in and the serviceability goes to the wayside. And so what we did with our new design, if you go check it out, is we took serviceability to the heart of that, because that's customers told us what they wanted. And so what we've done is we've actually now have an external power shelf. So this external power shelf, we've actually taken the power supplies out of the compute tray, and by doing that with these power shelves, we're actually able to give you full performance without throttling. So think about this, if you have four power supplies in a server and one of those power supplies go down, guess what? You got three. And guess what? Your performance suffers and degrades. Well, now these external power shelves, I got six power supplies in the power shelf, I can still lose two and I still have four active, and you can still run full performance. Not only that, I could put multiple power shelves and you can lose a whole power shelf and you'll still be up and running, don't have to throttle, don't have to worry about your job not running. And guess what? You come back 15 hours later, you're good to go and you don't have to start from scratch.

>> And you guys are doing some other innovative things, and I don't know if it's new, but it was somewhat new to me, where instead of walking behind and feeling a hot aisle, you're actually, the hot aisle is inside the rack. You do some interesting things like that. Obviously the hybrid air and liquid cooling, do you see that, Armando? Do you see that hybrid as the norm going forward?

Armando Acosta

>> Yeah. Well, I mean, everybody takes a different path and a different route, but for us, we do believe in hybrid. And here's the reason why. If you direct liquid, cool, the CPU and the GPU, you've already solved 85% of the power problem. And if you could use high proficient fans like we do, and essentially are specialized fan algorithms, you can actually cool that other 15% with fans. Now, here's the thing that customers have told us. If I try to put direct liquid cooling on voltage regulators, I've tried to put direct liquid cooling on memory DIMMS, I try to put it on CPU memory, all of that. Well, guess what? You have a lot of copper throughout the chassis, and if the DIMM fails, it takes you an hour to get to that DIMM slot in order to replace it. So what we do is say, hey, just do the CPU and GPU and then we could show air flow. But oh, by the way, that gives you better serviceability for the components that typically fail the most.

>> It reminds me of that $20 part in your car that cost $800 to fix-

Armando Acosta

>> To get to....

>> just to get to it.

>> Literally. I know, exactly, that's why I buy old cars. They didn't design them that way. They designed them to be fixed.

Armando Acosta

>> When you try to replace a battery, you have to take the wheel off, you're like hey, The battery should go on top.

>> You mentioned silicon diversity before, I mean, you guys have always had multiple suppliers, but it seems like the suppliers are throwing more at you every day. How are you managing that for customers both internally at Dell and externally for customers?

Armando Acosta

>> So in our 17th generation server line, we've just announced it, but we actually have driven a new spec called DCMHS. So it's also HPC, excuse me, OCP. I said that wrong. But with the DCMHX, it's a modular host system. And what we've done is say, "Hey, if you all build to the same spec, we are able to essentially validate and test, and we're actually able to build that in a form factor quickly for you." So if we have our partners, they come to us, Hey, this is one form factor, here's another form factor. Well guess what? I don't have time to build three different chassis for these different technologies. If you all build to the same spec, guess what? I get one chassis and I could drop whatever you want. And that's better for our customers because I could them flexibility, as you know, you've been in the game a long time. Not all workloads are created equal, and not all technologies are created equal. And so our customers do want to use Intel, they want to use AMD, they want to use an arm and a CPU with Nvidia, or excuse me, GPU. So we want to enable all of that. And not only that, by driving the standard, guess what? We're able to essentially be agile, we're able to incorporate it faster, and we're able to deliver time to market and faster time to market as well.

>> It's driving the standards, which is really important, and we're in such a green early area with that where there's still that blueprint being drawn. I mean blueprint, I guess in this case, very much being a pun, Dell being one of the people really driving that pen and architecting that blueprint. But you're also talking about modular design, which I think is one of the things that's very uniquely core, perhaps pun intended, to Dell's design this year and with, I mean, you have Project Luna. You're doing this on the laptop level and you're doing this on the biggest, most powerful rack level that's humanly possible right now so that you can plug and play. How much shorter is that hardware innovation lifecycle than say 10 years ago?

Armando Acosta

>> Well, you're talking 10 years ago, you would typically produce a server every 18 to 24 months. Fast-forward where we're at where today, I mean, one of our trusted partners, Nvidia, they're producing a new product every nine to 12 months. And so for us, you've got to drive those standards in order to be agile. The other big thing we're doing is around direct liquid cooling. If you know anything about direct liquid cooling right now, it's a wild, wild west. What we're driving standards is into the manifolds, the quick disconnects, the O-rings that go into that. We're actually now designing essentially saying, "Hey, here's our spec and can you meet our spec?"

>> It's amazing how much is advanced in just the thinking and the discussion around those standards. I mean, it is the Wild, Wild West. It was kind of amateur hour and you got these multi hundred billion investments-

Armando Acosta

>> That's finished.

Armando Acosta

>> And if they have leakage, that's a problem. And so-

Armando Acosta

>> That's a multi-million dollar coffee spill right there.

>> Exactly. And so I'm really encouraged to see all the intention you guys have put on that. And I got a tour of Tim shed, Ehab's lab.

>> Oh, to Dallas, you got a tour of that.

>> It was very cool. We spent quite a bit of time there and then we went inside. It was really loud, but it was great.

Armando Acosta

>> You actually got to talk to the brain trust, Ehab and Tim Shed-

>> Good time.

>> We love Ehab. Ehab's a regular.

Armando Acosta

>>

>> One of the things that they shared with us is, and I think this came out at OCP maybe this year or last year, we want to drive warm water through the liquid cooling. How far do you think that can take us with that innovation?

Armando Acosta

>> Well, I mean, when you look at, there's different specs per country, and then there's just, everybody has different environments, different types of chillers. So, what we try to do is we try to go for a range, but everything that we do is we test within that range. If you wanted to use 30 seawater, if you wanted to use 35 seawater, if you wanted to use 42 seawater, we go and test and validate that all for you, all right. When you looked at that innovation lab that you mentioned, guess what? We are testing the CDUs. Guess what? We're testing the manifolds. We're testing the quick disconnects. We're testing everything within that rack. And since we're doing the full solution testing, so that way when it gets to your four walls, you're not going to have those problems. And we've already tested it in our factory.

Armando Acosta

>> That is not surprising and also pretty awesome to think about. Now, I'm super FOMO, I have to go to test drive. I'm crashing the next-

>> It was an awesome tour.

Armando Acosta

>> I'm crashing the next trip to Texas. How do you... Because I can imagine there's a lot of partners and players, frankly, who want to see their solutions inside your greater solutions or inside the AI factory. What's the evaluation process as you're thinking about which components to pull into this ecosystem?

Armando Acosta

>> Customers are always going to be our guiding light. That'll continue to be what we listen to and we're always going to listen to our customers want. And essentially we're going to deliver the best of re-technologies based on those needs. When you look at our partnerships, we partner with the best. And not only that, the reason we partner with the best is we know we cannot build an end-to-end solution on our own, so this is why partnerships are key. And so we're always going to partner with the best of breed, and essentially what it's going to boil down to is what customers want, when they want it and how they want it. And we're going to do it.

>> Well, to that end, let's talk about networking. You got InfiniBand, have Ethernet, the Ulta Ethernet Consortium, you got NVLink, you guys, you're like T. Boone Pickens all of the above.

Armando Acosta

>> Going back to the energy sector, Oklahoma State-

Armando Acosta

>> We're going all over.

>> We've really-

Armando Acosta

>> So with networking is because there was an article in the journal today about the data centers need new plumbing, and I thought they were going to talk about liquid cooling. They were talking about networking. And it was interesting, the article invoked Cisco, and I'm like, this isn't about switches and routers of the old days. This is about new designs, both within the XPUs or XPU to XPU and then across clusters. But tell us about what are your thoughts on the state-of-the-art networking for AI. How do you think about it?

Armando Acosta

>> Well, the beauty of this is we're going to support all three. So if you look at Michael's tweet, he tweeted it out this weekend. We were actually the first to ship the GB 200 with NVLink. So we shipped that to a

>> Congratulations, by the way. Very exciting.

Armando Acosta

>> Thank you very much.

>> We saw a lot of other tweets and on Monday.

Armando Acosta

>> And the other big thing is we still believe in InfiniBand and we also believe in Ethernet. If you look at what Broadcom's doing with Tomahawk 5, you're getting up to 400 gig. But the beauty of this is that customers told us we want all three and we're going to support all three. So, if you go to look at the rack that we have in our booth, Arun's going to come up next and talk a little bit about more. So I won't steal his thunder, but if you look within that, we'll support a 200, NVLink 72. We can support that. We also have another product coming out, an NVLink 4 system. That NVLink 4 system you can actually scale out with ethernet or essentially InfiniBand. So we offer all three options. Because here's the thing that you've got to consider. When you start to look at HPC, we knew the fabric was always important. That's why we did InfiniBand. But now if you look at that combination of HPC and AI, the fabric becomes even more important. And so the other thing that we're also looking at doing is not just the east-west traffic to GPU-GP communication, now we're actually looking at the north-south traffic as well. So your storage connection, hey, if you want to train a model really fast, guess what? You better have a quick pipe and a quick connection to your storage because you got to bring that data up and you've got to actually train the model. And guess what? The only way you get this, if you train the model faster, you get to insight faster. And guess, once you get to the inside faster, what's going to make a difference.

>> Isn't that funny how we've come full circle?

>> It's Happy Valley, baby.

Armando Acosta

>> Exactly.

>> We've come full circle on that traffic.

Armando Acosta

>> But no, we're going to support all three and we want to support all three, and we believe customers will need all three.

>> You mentioned, I could tell you're a community-focused, customer-focused team at . I know that Dell is as well, but you've definitely brought it up quite a few different times. Are there some customer use cases that you think we'll see perhaps realized or in the works right now that you are able to share with us that get you excited as a technologist and as a human?

Armando Acosta

>> To me, I mean, we've talked to customers this week are using AI for cancer research, and you're dear to my heart. You see essentially virtual assistance now, where it's not just answering a simple question, you actually have to answer a question and they actually help you take proactive measures. So, that's the next thing we're looking for, those virtual assistants. And then you know what I see? Recommendation engines are still going to be key. And of course, large language processing of large language models. I mean now you see, hey, large language models with 10 billion parameters, 20 billion parameters, 80 billion parameters. And if you think of those parameters and essentially all the learnings you get from that, it is for the greater good and it's going to raise all boats. And that's what I'm excited about.

>> Well, it certainly gets me excited, even just sitting next to you, your excitement is contagious in the best way. Can you give us any preview for what we can expect next coming out of your team in the Dell squad?

Armando Acosta

>> I can't comment on any of that, but what I would like to tell you is please stay tuned-

>> Worth a shot.

>> Absolutely.

Armando Acosta

>> Our number one goal is to make a difference for our customers. And our number one goal is to make a life easier for our customers. And that's what we're going to do. We're going to put the easy button on AI. We're going to put the easy button on Dell AI factory and then Arun's come and talk to you about what we're doing in our factory integration as well and what we're doing in that area. So end-to-end solutions. And the other thing that you talked about earlier with AI, it's a journey. And so when you think about it, when you start small, you typically do it on a desktop or a workstation. Hey, your model gets a little bit bigger than, "Hey, I need a server. Hey, my model gets bigger and bigger. I need a cluster. And hey, now this is going large scale, so I don't need a cluster that's eight nodes. I need a cluster that's 164 nodes. I need a cluster that's a thousand nodes." So, the beauty of this is that we want to meet our customers wherever they are at in the journey. So if you're starting small, come talk to us. If you're at the biggest of the big at rack scale, let's just show you what we just did with GB 200.

Armando Acosta

>> It's one of the things I really love about Dell is that goes back to the community piece, is we're here with you on the whole journey. And you're there to be, not just a vendor, but a partner throughout that iteration journey, and everyone can affect your product roadmap.

Armando Acosta

>> Just a little plug, we have an HPC community event that we had Monday. And the biggest thing about the HPC community is we have our customers come and talk about what they're doing, the customer's talking about their use cases and essentially what they're delivering, because what's most important. If we make our customers successful, we'll be successful.

Armando Acosta

>> It's that culture of collaboration that always keeps Dell at the top. And it's inspiring to see, I think we see it across, even on the show floor here, between government, between hyperscalers, between big companies, between the little guys and the startups. It's all happening here, folks. Last question for you, Armando, since this is our now ritual and tradition every year at Supercomputing, and I realize you can't spill the beans on the future announcements. I can't help but ask, but what do you hope to be able to say when we're hanging out in St. Louis for Supercomputing 2025 that you can't yet say today?

Armando Acosta

>> What I want to be able to say is that we've enabled our customers to get faster time to value with their solutions. The necessary evil is you got to build a cluster, you got to deploy the nodes, you got to deploy the fabric, you got to deploy the storage. But the biggest thing that we want to do is we want to roll in a full solution into your four walls where all you have to do is connect the water, assign IP addresses, and let's roll. Let's get going. And that's what we're all about. We're going to put the easy button on AI and HPC.

>> Love it.

Armando Acosta

>> Love it. I was just going to say, it's that easy button, baby. We need a blue easy button. I can see in my mind's eye, we're going to have to figure that out. Armando, thanks so much for taking the time.

Armando Acosta

>> Oh, thank you. I appreciate it so much.

>> Nice having you here.

Armando Acosta

>> It's such a great conversation with you. I look forward to this every year.

Armando Acosta

>> Oh, good. Well, that makes three of us, and thank you, Dave, for quite an afternoon we've had.

>> Oh, it's been great.

>> I'm loving it. I hope all of you are having as good of a day as we are here, learning from the world's smartest people at Supercomputing 2024 in Atlanta, Georgia. My name's Savannah Peterson. You're watching theCUBE, the leading source for enterprise tech news.