SC24 | Ihab Tarazi & Arun Narayanan, Dell Technologies

Clips
News
More from SC24

Ihab Tarazi

SVP/Chief Technology Officer

Dell Technologies

Arun Narayanan

SVP, Infrastructure Solutions Group, Compute and Networking Product Management

Dell Technologies

Dell’s AI evolution: Driving tangible innovation through open standards and scalable systems

For Dell Technologies Inc., the journey into artificial intelligence and high-performance computing has been a meticulously planned marathon rather than a sprint. What once was a hardware-focused company is now at the forefront of open innovation, blazing a trail in modular AI systems and networking solutions.Dell’s Ihab Tarazi talks to theCUBE about modular AI.“Three years ago, we started to work on scalable systems, which is what you have on the floor now, and we wanted to be a much bigger player in HPC as well as AI,” said Ihab Tarazi (pictured, left), senior vice president and chief technology officer of core, AI and networking at Dell. “We took a different path than most people

play_circle_outline Advancements in Dell's Scalable Systems for HPC and AI in Last 3 Years and Open System Development

play_circle_outline Accommodating different cooling options (liquid-cooled, air-cooled) for customer use cases

play_circle_outline Importance of open hardware and software for industry innovation and speed

play_circle_outline Cultural shift at Dell towards open standards and rapid product development

Info
Transcript

Ihab Tarazi & Arun Narayanan, Dell Technologies

Ihab Tarazi

SVP/Chief Technology Officer Dell Technologies

Arun Narayanan

SVP, Infrastructure Solutions Group, Compute and Networking Product Management Dell Technologies

At SuperComputing 2024, SC24, John Furrier and Dave Vellante host theCube's live coverage. Dell's Ihab and Arun discuss progress made in developing scalable systems for HPC and AI based on OCP specs and DCMHS for compute. Dell's portfolio includes rack-scalable systems for liquid-cooled setups and air-cooled systems. Emphasis on open systems allows industry-wide innovation in hardware and software, driving faster product development cycles. Dell focuses on building a robust software ecosystem to complement its hardware offerings in the AI space. Networking, c... Read more

explore Keep Exploring

What process did you undertake to develop a fully open system for HPC and AI that is based on OCP specs, ORv3 21 inch, and DCMHS for the compute, while also focusing on density and performance? add

What are the key considerations when designing a rack-scalable system for different customer use cases? add

What is the impact of open source (both hardware and software) on innovation, particularly in the context of AI? add

What is the next set of software coming from entrepreneurs and enterprises in response to the changing landscape of AI technology and distributed computing? add

bolt Powered by CUBE AI

Ihab Tarazi & Arun Narayanan, Dell Technologies

search

>> Welcome back everyone to theCube's live coverage here at SuperComputing 2024, SC24. I'm John Furrier, Dave Vellante, my co-host. Of course, we run the Pod, theCUBE Podcast every Friday. Check it out. All the long form commentary in the industry. Two great CUBE alumnis, and a lot of the brain trust behind the success of the Dell AI success here at Supercomputing, Ihab and Arun. Ihab is obviously the SVP CTO of AI compute and networking at Dell. And Arun is SVP portfolio manager for computing. Guys, first of all, great to see you again and congratulations on a great show. Dell really brought out the A game today and this week. Congratulations.

>> Thank you.

>> Thank you very much. Exciting.

>> I know we've had a lot of conversations in the past three years here at SC and other places. The work's been in progress, so it's a lot of hard work to get here, so it's a good day to celebrate when you bring out the good goods and you got the factory and you got the real high performance. Take us through guys, couple of years back. What was the catalyst? Take us through how we got here real quick as context, because I think it's super important to know that this just doesn't happen overnight.

>> Yeah, I mean, like we said before, it's exciting to see it and feel it. This is the kind of show where people want to see something, and we made tremendous progress. Three years ago, we started to work about on scalable systems, which is what you have on the floor now, and we wanted to be much bigger player in HPC as well as AI. We took a different path than most people. We decided to build a fully open system based on OCP specs, ORv3 21 inch, and also DCMHS for the compute and fully open on software. And we also wanted this to be the most dense and it's totally best in performance and a fully open system.

Dave Vellante

>> So that 21 inch, if I could just follow up on that. We saw that in your lab, my understanding is some manufacturers, some silicon manufacturers want to go wider, some want to go deeper, some want to go both. And so you said this is an OCP, I don't want to say mandate, but that's their sort of recommended spec, right? So, you can accommodate virtually anything. Is that correct?

Dave Vellante

>> Yeah. OCP specs, we're fully compliant with OCP specs, which means anybody can build to it and we're able to accommodate the ecosystem. However, we did modifications, as you said, what you saw in the lab, to make it much more operationally easy. So, all the cabling is easy to do, it supports all the power. It supports the manifolds for liquid cooling in a very simple way with quick disconnects.

Dave Vellante

>> The portfolio's not so simple anymore.

Dave Vellante

>> No. So, I think what the way we've thought about our portfolio, right, is we want to build rack-scalable systems, which the way I think about rack-scalable systems is threefold, right? It's designed by Dell, engineered by Dell, manufactured by Dell, and supported by Dell. That's what you need to think about about what a rack-scalable system is. And the way we've designed this is we want to support all customer use cases. Liquid-cooled in 21 inch ORv3 standards, liquid-cooled in 19-inch. So customers will have brownfield data centers and we'll have nineteen-inch data centers, we'll have liquid-cooled there. And then finally air-cooled because there's a portion of the enterprise that's still going to be air-cooled. And we built rack-scalable systems for all three scenarios and all three customer types.

Dave Vellante

>> Well, and I think what's interesting about the last thing you said is, I get excited about liquid-cooling, and again, I get the lab tours, you get pumped up about AI but-

>> I like air-cooling and I cooling on chip. Actually cooling on chip is pretty cool.

Dave Vellante

>> Yes it is, especially direct-to-chip. But if I'm a CIO, I'm like, I'm here today and I'm point A, and I want to get to point B, and I don't have a billion dollars to spend on getting there and hiring a bunch of consultants. So how can I get leverage out of my existing infrastructure, which is Air-cooled?

Dave Vellante

>> Yep.

Dave Vellante

>> So, you have an answer for that?

>> We do. I'll tell you this too. I mean if you think of our existing data center, somebody has a brownfield data center. If you move from a 14-generation server to our 17-generation server, you get a seven-to-one consolidation. So you can create space in your existing data center to put AI workloads in it, an air-cooled phenomenon. So we are building infrastructure that allows for all customer types to scale. That's what we are trying to do.

>> So, on this rack system that you guys have, is integrated rack, scalable solutions, that's kind of the category. Dell engineered, Dell made, saw all that innovation. What jumped out at me was the OCP compliant because obviously Dave, remember we did the first ever OCP-

Dave Vellante

>> Yeah, I spoke at it.

>> was there, and it was the beginning of open source infrastructure. And I can't tell you how many times we've had conversations here on theCUBE with you guys and others around the debate between open and closed. Open ethernet, for instance, is as hot as can be because why? This ecosystem, open source. And the other thing I want to get your reaction to guys, if you don't mind, is the term we've been hearing on theCUBE. I feel like I'm back in the '90s, which at that time was an inflection point for infrastructure. Open systems kicked in and that changed everything. We're at that new wave here. It's like the old trap is over. How important is open? Because you mentioned a few times, we're hearing open source, you see all the open source models, open compute project, which the rack is now fully compliant with. Why is that such a big deal? I want to get your opinion as leaders at Dell.

>> Yeah, I will start. As you said, correctly open source, both hardware and software have been a catalyst for innovation. I would say this cycle now with AI, innovation is happening at 5x the speed that we've ever seen before, both from software, new models, new optimized models, smaller models, as well as hardware, new accelerators-

Dave Vellante

>> That's not hyperbole. That's like real.

>> That's what's happening.

Dave Vellante

>> That's the data, right?

Dave Vellante

>> And there's no way to do something like that easily if you don't allow full industry innovation. So an open system means people can innovate on CDUs, they can innovate on network, they can innovate on ethernet, GPU, CPU, and have a blueprint to write to it. So, if you are one of the silicon companies, you know you're going to design for OCP specs to speed up adoption or liquid cooling companies know what quick disconnects look like, for example. It's such a big deal now given the massive explosion of innovation.

>> And I just for the record, CDUs is the coolant distribution unit, right?

Dave Vellante

>> Yes. Which is part of the liquid cooling system.

Dave Vellante

>> So okay, I get that. But Dell is a very pragmatic company, and Michael drives frugality throughout the company. So when you have to support all those different options, that puts a lot of pressure on you guys. So, was there ever a point of a conversation inside the company, "Well, maybe if we go more homogeneous, it'll be easier for us. We'll save, be more efficient that way versus," I mean it really puts a lot of pressure on you. How did you think about that?

>> I'll tell you this, right? You said Michael, right? It goes back to our history and heritage. We have been an open-standards company from the very beginning. So the challenge to us was, you need to hit open standards and you need to hit time to market. So if you have to do both, you don't get a choice of one or the other.

>> Hard.

Dave Vellante

>> Yeah.

>> So we worked really hard to pick open standards first, no compromise on the base principle. And then we started figuring out how to innovate really fast. We've changed almost every process in our company to hit the inflection points of the market. So our product development cycles used to take 18 months to two years to develop a server. We are in six-month cycles. Sometimes we're developing products in four months. We have changed every ecosystem in the company to make this happen. But we will not compromise in the principles of open standards and time to market. That's our core tenet.

Dave Vellante

>> So, how did you do that? Was that new skilling? Was that new tooling? Was that new training?

Dave Vellante

>> It was combination of things, right? First we need to have one, buy-in from everybody in the company to do that. So that we got from Michael's blessing and everybody else's direction, we had to change our process. Ihab built a team out called a CTO pod team. Those pod teams are looking and talking to customers and giving us requirements real-time and we worked with the engineering to build new capabilities to build products really fast. So, it was process, technology and all of the manufacturing in the backend to change in order to make this happen. So, complete retooling of our company is a bit-

>> You actually took the process and reconstructed it for today's modern era with insight and action tied together as quick as possible

>> To go 5x faster.

Dave Vellante

>> And I also add from a technology's perspective, we've designed the system to be modular with building blocks, instead of building every single server as a standalone platform. And now you have building blocks and form factors and you can respond to the market very quickly by picking these building blocks of modular components and put them together. That's what sped up the product development engineering process by a significant factor. And this way, we're also able to respond to some of the largest AI customers with some customization that's needed and respond to the market.

Dave Vellante

>> I mean, right. So that's because enterprises still want customization. So this is again, this interesting tension. You want things to be standard, but then enterprises say, "Well, if you do this, you'll get the order."

Dave Vellante

>> Yeah, we're disciplined about the customization we do. So we usually go after customization if we think it's going to benefit the majority or a large segment of customer. And then as Arun was saying, we built it into the product development process to adjust our roadmap dynamic.

Dave Vellante

>> So, if it's a Snowflake-

>> We probably will.

>> Yeah, we wouldn't do every Snowflake out there. So that's how we are choosing and being deliberate about it.

>> And that's based upon intentional known use cases you think that are going to be bigger?

Dave Vellante

>> Customer use cases, which we believe will scale into large scale volume and revenue. So that's something-

>> Yeah, the change, I will go back to what Arun has said is think of it as faster response to customer input. Because the speed of innovation also means customer input is coming faster. So working upfront with the customers, designing these systems for them, we design the network, the storage, cooling, data center design, system design, we pull all that input and then quickly we recognize the patterns that customer's asking for, feed it into the product and engineering to update and be nimble.

Dave Vellante

>> So, two major vectors broadly, and within them, I'm sure there are a lot of patterns, but the first being the big LLM vendors, the big five. We know who those are, I think you sell to many of them. I'm not sure if you've named names, but whatever. We know you're selling to some of them. And then the enterprises, the Jamie Diamons of the world, with 150 petabytes that are going to be unique.

>> They're going to need an AI factory.

Dave Vellante

>> Okay, so now you have to have a portfolio that... So you only have five on this side, and then maybe there's five others that are sort of small versions of that. And then you've got tens of thousands, hundreds of thousands on the other side. So you've got to accommodate both. I wonder if you could speak to the patterns that you see there and how you're engineering to them.

>> Yeah, great question. So when we see the product development cycle and the market itself, think of the enterprise customers. I mean, the ones, the JP Morgans. They are still going to do inferencing. Some of them are small scale models, but the inferencing is the main use case. And what we see is they are going to be on PCIe technology. So if you look at our floor, we built two new PCIe platforms called the 7740 and the 7745 for precisely that use case. You can put eight double width GPUs inside it. You can start small and you can scale up. So for those people, we built those platforms. On the other extreme side of it, for the big large LLM models, we have the GB200 NVL72 in a 21-inch rack design. Our goal here has to enable the AI portfolio to scale to all types of customers. And second point, I think you have made a really important point. How do we do that? It's because we built our servers on DCMH's standards. So, I can take some of these building blocks from this small server and use it in the largest server. It increases that -

Dave Vellante

>> For usability, it leverages-

>> The usability is leveraged.

Dave Vellante

>> Oh man, great operating leverage.

Dave Vellante

>> And that's how we've been able to deliver to the market.

Dave Vellante

>> So what else are you speeding up at Dell? How's the product launches and have you matched the entire end-to-end product lifecycle? I mean, it sounds like a seed change culturally for Dell.

Dave Vellante

>> It's a complete cultural change, but also when you look at our roadmap, we've tried to cover every aspect of where we think the AI market is going to go, right? I mean, all the way from the smallest scale to the medium scale to the large scale. And our product, the 9680, was a single product, this is the most successful product in the history of Dell. I'm hoping that the next five products we built right now on the show floor are just as successful or even more successful.

>> We've talked about this in the past, and Arun, we touched about it a little bit earlier in our segment earlier in the day, and every wave things change so fast that it's almost the next version comes out. But you look at the PC revolution on the processor side and memory, some it's component, but mainly processor, it was always backwards compatible. So it got better, but it still worked together. So having modular designs and open standards ensures that interoperability and backwards compatibility. But with the AI, it was a little bit different. You had transformer technology come out that kind of made all those algorithms that I used to do in the '80s relevant. And now a new architecture emerged and then new software came that we've never seen before. Now the new architecture with factories, what software do you guys see coming to take advantage of this goodness? And by the way, it's not just on-prem, because all the top customers want to connect to the cloud too, so it's distributed computing. So it's not so much on-prem cloud battling anymore. It's more, "Hey, I just want the big iron on-premise because that's where my action is, my IP is. So, I might connect to the cloud for some customization or just do it all here." So what's that software, the next software coming that you can squint through this current setup? Because ChatGPT set the table, we see some RAG out there, we see some nice automation process improvement. What's the next set of software coming from the entrepreneurs, from the enterprise? The enterprise only has like 1% penetration on AI.

Dave Vellante

>> Very good question, and I agree with you completely. We talked so much about hardware, but the innovation and software and the ecosystem and what we're doing there is probably just as big of a story if not bigger. A couple of things there. We have been consistently working on the new software ecosystem. AI is a new system of software companies. It is incremental to the software ecosystem we had before. So, we've talked before about Hugging Face, all the containers almost come from there as models. We're the only company on-prem partnering with Hugging Face right now. We built an enterprise hub very close to Meta. We presented a connect, we implemented Llama stack, we have all the capabilities. Again, we're unique in our partnership with Meta. We are going to announce two, three more software companies, but we see five, six software companies are all coming out. You should think about it this way. Many of the big customers who buy our platforms are also software companies. So, when they're training the software on our platforms, that means we can easily take that and go to market with them, with all our customers. So, we will continue to build that ecosystem, which is critical.

>> Arun... Oh, go ahead.

Dave Vellante

>> No go, please.

>> Arun, because I talked to you guys in the past about the future of AI, this is like three years ago. I said the developer in the old days was, everything was local hosts on your Mac or PC, and then you just upload to the cloud or you buy a box and provision it and then hope it works. And you now the servers, the clusters of that are the developer platform. So, developers would just buy a PC and that's fine. The AI Factory seems like it's the same thing. So, these developers, they need big iron, they need inference. How are you looking at that developer market? Because if you're going to have that software ecosystem, those developers have to have easy codeability as if it's their machine. This is going to be, I think an odd area for the innovation. What's your thoughts on developers? What are you guys doing? How are you making the provision of that infrastructure, whether it's a rack in a shared factory or dedicated factory? I mean, what's the question? I'm going to need more horsepower to run my models. Is it going to be the new PowerEdge server with the four H200s? And that's a pretty big server.

Dave Vellante

>> Yeah, I mean it would be any of those things, right? I mean, the way we are thinking about it is the developers will have our platforms available to them and the key point is the AI factory is designed to be a solution. It has the software layer built in so developers can develop on it and we want to engineer it out of the factory, that it's a ready solution that they can do it. Think of the old days, you'd get a PC all set up so you can develop on it. We are developing an AI Factory that's coming to your data center all ready to develop.

>> It's like a developer cloud.

>> It's a developer cloud, right? On-prem developer cloud.

>> Yeah. Developer resource.

>> Yeah.

>> Better way-

>> factory. Inside the AI factory. Can we talk about networking for a moment? So at Mobile World Congress or MWC, I never really thought about Dell as a networking company, but now that Dell's an AI company, networking has become this new bottleneck. Article in the Wall Street Journal today talking about re-plumbing the data center. I thought it was going to be about liquid cooling, but no, it was about networking. But there's a new type of networking architecture emerging that you guys are, all of a sudden it's become a tailwind for you. I wonder if you could explain that a little bit.

>> Yeah, we talked about networking before, us here. Very exciting topic for me. I think networking is the most important element in performance of an AI system. So we've been extremely involved in networking. Today, we have big partnerships with the big silicon providers and we're expanding them more and more. In addition to all the gate work and all the collaboration we continue to do with Vodcom. We're also working with NVIDIA as well, who was a great networking company and has very good solutions for AI. So, with both of them, we are designing switches, optics, and now we started to deliver to some of the big customers, everything, from structured cabling to a fully designed systems we install, test. For the most part, you said it correctly, network has become part of the system and what's happening is it's not independent anymore. For most of the players, we're doing both. And the same OCP open specs that you saw here, we're going to take all of that to networking and to amend this innovation in the networking space with the same innovation we did for compute.

>> But if the networking doesn't work in the AI Factory, it may not be the biggest cost item compared to say the GPUs, but it's a critical failure point.

>> It's a critical performance implication.

>> Explain this because this is one of the biggest things. All the networking guys and gals know this, but we got to keep this in mind that GPUs, they're designed for job completion, not jitter and slow down. I mean, explain why this is such an important feature not to be overlooked.

>> I think there's two things about networking not to be overlooked. One, if you don't tune the network at the software with the communication libraries, let's say you are running 80% utilization GPU to GPU, that means you're leaving 20% of your GPU capacity idle. If you paid a hundred million for a system, that's like a $20 million price tag. We've tuned our network now at 97% throughput on those links, almost at line speed. The second one, with these big models, and even for inferencing, if you lose one GPU because of network, the whole system training stops. So, you lose the entire cycle, it doesn't matter how big the cluster is. And then you have to go checkpoint, save the model, start it. So, one network failure on one optics will stop the whole job. That makes it very critical .

>> And that restart, by the way, is a complete more GPUs to redo everything.

>> Exactly. .

>> And think of it, your have a hundred million dollars system and optics cost a thousand dollars, right? The proportion of impact isn't-

>> It's significant.

Dave Vellante

>> It's like the connection integrity in a hose, for crying out loud.

>> Exactly. Excellent example.

Dave Vellante

>> You got to knock those little problems down because they can make big problems.

>> So you want people who are building those systems, highly reliable.

>> All right guys, you guys are really rocking the show. Final question, what are you most proud about from the performance to getting this point? Obviously you're getting great reviews here at the show. What is the thing you guys are most proud of that people may not know or may know? Or share a cool thing.

>> I'll tell you what I'm most proud about, and then Ihab can tell him what he's most proud of. I don't know if you also, on Sunday evening, Michael tweeted the GB200 NVL72, the first in the world. I mean, that's an incredibly proud moment for us, right? I mean, we have worked on that and ahead of everybody else, ahead of any other company out there doing it first in the world, that made me really proud as a product manager. I don't know what Ihab thinks, but-

>> I said that looked like that thing from Star Trek, that big machine and just ran everything.

Dave Vellante

>> Yeah.

>> Exactly.

Dave Vellante

>> I agree with Arun. That moment may look like a simple statement, but what it took, it's not just we had the first NVL72 GB200, it's not just deployed. The running workload on all 72 GPUs in an open rack. Very unique. That took three years of work to get to this point.

>> Okay, so now the inside baseball question, how long was the line to get access to it? Was it like everyone down, pick me, I want to run my workload on it. I mean, that's the dream. That's like getting the new car first, right? Come on.

>> I'm going to answer a different question. We have a large number of customers who are working with us-

>> customers, yeah....

>> and we're happy to get early acts to all of them. We have hundreds of people who want to get their hands on that thing.

Dave Vellante

>> Well, as you know-

>> Everyone's your new best friend, right?

>> Exactly.

Dave Vellante

>> This show is amazing. It's like GTC for open systems and GTC was a watershed moment for Dell with Michael in the front row. You were actually there. I don't know, Arun-

>> I wasn't there.

Dave Vellante

>> I remember, Ihab, seeing you on the screen. I'm like, "Oh my God, look at that." But that was really a turning point.

>> Yeah.

Dave Vellante

>> I think just perception-wise about Dell and I think it motivated a lot of people internally. It was a really flashpoint for the company and now it's just up and away.

>> All right, the game starts. What's next? Arun, what are you going to work on now? What's next for you?

>> I think Ihab hit it upon, there's three areas we're really focused on. I'll start on networking. I mean, we are going to spend a lot of time on networking. Second, you all see a lot here on cooling, cooling technologies, how to innovate and go faster and go there even more. And then the next generation of AI platforms. These are the three things I'm really working on in that order. I mean, getting networking right is really important. Getting cooling right is very important. And then we just launched current generation. We're going to next generation AI technologies.

>> Awesome.

Dave Vellante

>> Yeah. New software partners, and more of an easy button for software for customers everywhere. As part of AI factory, it's going to be really exciting by the time we get to Dell Tech World. Number two, I would say networking, but also we're building the most advanced storage servers with very dense media, very good connectivity. We're building storage media specifically for AI with everything we're learning. And then you're going to see us announce more and more big customers and great collaboration.

>> Well, we want to talk to them. We're doing our job here on the CUBE factory. We're going to do our digital twin of the SC24 in Palo Alto, and the New York Stock Exchange to see a lot more editorial content come on this topic. We want to showcase the customers that have the use cases, there's a real appetite for what my peers are doing, what are best practices. A lot of choice involved, a lot of architecture being done. So again, congratulations on the great factory model, AI Factory. Thanks for coming on theCUBE. Appreciate it.

>> Thank you.

Dave Vellante

>> Thanks, you guys.

Dave Vellante

>> Thank you very much. .

>> The leaders at Dell, making the AI Factory reality. Of course, the rackable integrated systems are really the key. And of course, you got to have the GPUs and the machines and the process and the networking. Of course, theCUBE is here, day two. Stay with us for tomorrow, day three. Thanks for joining. .