SC24 | James Coomer, DDN

Clips
News
More from SC24

James Coomer

SVP Products

DDN

DDN appliance upgrades boost GPU performance and capacity efficiency for AI processing

With the launch this week of upgrades for its data intelligence platform, DataDirect Networks Inc. has signaled its intent to bring significant scalability and GPU performance to the infrastructure necessary for processing AI workloads.The company’s deep integration with Nvidia Corp. GPUs and the introduction of its next generation A³I data platform are designed to translate into faster AI training and delivery of real-time insights.DDD’s James Coomer talks with theCUBE about GPU performance during SC24

play_circle_outline DDN's Big Announcements and Supercomputing 2024 Updates: Advancements in Performance Density, Efficiency, and Data Compression

play_circle_outline Exascaler parallel file system and online upgrades

play_circle_outline Collaboration with customers like NVIDIA and Swisscom

play_circle_outline Prioritizing roadmap based on common customer requests

Info
Transcript

James Coomer, DDN

James Coomer

SVP Products DDN

AI fans at Supercomputing 2024 in Atlanta are learning about DDN's growth in AI technology. DDN has announced enhancements to their data intelligence platform, doubling the density of their AI appliance and supporting over half a million GPUs globally. The new A3I platform is designed for the latest Blackwell GPUs. DDN's technology enables customers to tackle tough AI challenges, improve model training, and manage checkpoints efficiently. They are focusing on HPC and AI, prioritizing developments that align with industry needs. In Scotland, AI technologies li... Read more

explore Keep Exploring

What are the recent announcements and developments made by DDN regarding their data intelligence platform? add

What analogy is being discussed and why is it relevant in the context of AI models and parallel file systems? add

What is the history of collaboration between DDN and NVIDIA in developing large scale language models and optimizing infrastructure? add

What is the current phase of AI and why is it beneficial for addressing various requests and challenges in modeling and data processing? add

bolt Powered by CUBE AI

James Coomer, DDN

search

Savannah Peterson

>> Good morning, AI fans, and welcome back to beautiful Atlanta, Georgia. We are here on day three of Supercomputing 2024 and the fun is still rolling. My name's Savannah Peterson, so excited to be having these conversations this week and particularly excited about my next conversation with James from DDN. Welcome to the show, James.

James Coomer

>> Hello. Thanks for having me.

Savannah Peterson

>> Yes, absolutely. We're going to unpack a lot. You've had some big announcements this week. We had Alex on the show as well, but before we even get there, how has this show been for you? It's got to be quite a moment for you guys.

James Coomer

>> It is. It's been ramping up to this year for three years, it feels. We had a little bit of AI-ness a few years ago now, it's gone really massive this year, so everything's gone totally crazy. Our data summit, brand new customers from across the spectrum of enterprise, all experimenting with AI, moving into large scale AI, so a very different feel from previous years.

Savannah Peterson

>> Oh, fun. Yeah. I want to talk to you a little bit about that. And I couldn't agree with you more. When we were in Dallas, it felt like a whisper and we were talking almost as much about quantum as we were about AI.

James Coomer

>> Yeah, exactly.

Savannah Peterson

>> Then, in Denver last year, definitely some velocity, and now it feels like, to your point, it's mainstream. I feel like even our friends and family know what we do now.

James Coomer

>> Yeah. There's so many brand new booths from these AI cloud partners and stuff like that, which we've never seen before. They're all here now. There's 10 or 15 of them right over there.

Savannah Peterson

>> Yeah, yeah. I think there's over 30,000 people here too, which is pretty cool. You had some big announcements this week. Talk to us about those.

James Coomer

>> Yeah. The past few years, we've been seeing the same ramp, right? Huge ramp in AI and those customers come to DDN because we provide this data intelligence platform. And this year, we've been announcing several new developments of that platform and the magic of the platform really is it enhances the stack itself. The stack it's in. The storage, the networking, the compute, the infrastructure. But it also enhances, accelerates and optimizes above the stack, where the applications live, the inferencing, the AI model training. We're making various enhancements to accelerate even further with even less power, and those three announcements are really more around those areas plus simplicity and uptime. Let me go through them.

Savannah Peterson

>> Yes, please. That's a holistic approach and solution is what you just described. Yeah, let's break it down.

James Coomer

>> We were already doing pretty well, so we're already the densest and fastest AI appliance on the planet, and we just doubled that.

Savannah Peterson

>> Casual. And you support over half a million GPUs.

James Coomer

>> Easy.

Savannah Peterson

>> Which-

James Coomer

>> We're right there. We've been ramping up for it.

Savannah Peterson

>> I love it. I love it.

James Coomer

>> Yeah. Our largest customer is 100,000 GPUs, but overall, right now, we're supporting 500,000 GPUs globally and we've just doubled the density of our appliance. The performance efficiency and the capacity of efficiency are crucial to our customers. We want to minimize the data center space and power, because that's what we're running out of right now, so with huge data centers, they're struggling for power, struggling for space, and struggling for GPUs. Our job is to, A, help them reduce the amount of instruction they spend on our piece and deliver new acceleration and turbocharge to the infrastructure and the applications to get more out of the infrastructure. We see a data platform, not just as a piece of storage, but something that enhances the applications, accelerates and enhance the applications. Doubled the density of our flash systems, we've introduced-

Savannah Peterson

>> That's not a casual... That's not iterative. That's a nice feeling there.

James Coomer

>> And then we also introduced a new A3I platform. A3I, AI data appliances. Six years ago, Nvidia came to us, after looking at many vendors, and they chose DDN to support their first Selene supercomputer, and that was with the AI 400.

Savannah Peterson

>> That's a vote of confidence in the earlier days.

James Coomer

>> It was then. But even better, the second generation, they made the 8-100s, and they came to DDN again and they bought our second generation appliance, the AI400X2 and now we built the AI400X3, and that's been specifically designed to support the new Blackwell GPUs. Those Blackwell GPUs now have much higher memory bandwidth, much higher memory footprint, and of course, they need a data infrastructure that's going to cope with that huge turbo boost in performance at the GPU level, and that's exactly what we've done. That's number two. Massive performance enhancement, performance density enhancement, efficiency enhancement with the new AI400X3 appliance. And then, there's even more.

Savannah Peterson

>> I was going to say, "But wait, there's more."

James Coomer

>> There's more.

Savannah Peterson

>> One more thing.

James Coomer

>> Our exascale parallel file system. This is really what's been driving the largest number of new AI models in the world. We're behind Nvidia's large systems, we're behind xAI's large systems, we're behind large number of superpods, as the data appliance. And that software is architected really ideally to push data into GPUs. When you're running his models, you need to push data, push the models themselves, and bring the checkpoints down. It's quite a complex activity, but for maximum stability and efficiency, you use this software to do that very efficiently.

Savannah Peterson

>> Yeah, it's like two pedals in a car, essentially.

James Coomer

>> But we are, I think Nvidia put it best actually, we are the fuel for that engine. Data is the fuel for the GPU engine.

Savannah Peterson

>> Oh, I love that analogy. I'm going to borrow that from both of you.

James Coomer

>> It's a nice one. And so, when the exascaler side has... We've done three things. A, we've implemented a scalable data reduction, and that means you get more for less, and you can keep it on, which sounds like a funny thing to say, but with many storage technologies, you can get data reduction, but if you switch it on, in the face of 10,000 GPUs or whatever, it starts going slower, so we built a new way of doing this data production, which means we can scale that compute infrastructure and still maintain very, very efficient data compression.

Savannah Peterson

>> How do you do that?

James Coomer

>> We tap into the Linux kernel compression methods and run this compression right next to the application. And that means you're compressing over the wire and in the storage, so it's like double bonus. Much more for your dollar, much more for your power. Second thing we've done is make our customer's lives much easier. We've built a management framework around our parallel file system that allows them to configure, set up, change, manage, control, monitor systems with a brand new set of APIs, much, much simpler than any other parallel files system in the world. And then, finally, but not least, we've implemented online upgrades. The largest supercomputers in the world, AI or HPC, as of today, in fact, as of last August, they are able to keep everything online, applications running, 100%, in the very largest systems whilst we upgrade the software across the data infrastructure, which is a really big challenge for us and we've been-

Savannah Peterson

>> It's a bit of a paradigm shift. Downtime is a part of tech life, generally speaking, and you're telling me you're able to evade that?

James Coomer

>> It has been. Of course, typically, historically, you can do it at small scale. We're not in that scale anymore.

Savannah Peterson

>> I was going to say, not what we're talking about. This is scale. This is the definition of scale.

James Coomer

>> Thousands of GPUs, thousands of CPUs, thousands of network ports, and often hundreds of petabytes of flash storage. Keeping all that online while you upgrade the whole infrastructure is a difficult thing to do. And it's taken as many years, but now we're here. Exascaler 631, released in August. We've done it.

Savannah Peterson

>> That's so impressive. No wonder the most important and notorious companies in this space are powered by you. To your point, there's no surprise, you're fueling big companies like Nvidia. Give me some examples of what this type of technology can empower from a customer-facing or a human-facing application.

James Coomer

>> Our customers, they are performing the toughest AI challenges out there in the world. There's two halves of that story as well. You're either training this model or you're running this model in production, and both are hard and both require data.

Savannah Peterson

>> Lots of data.

James Coomer

>> Yeah. And the training phase has really been happening intensively over the last two or three years. And to do that training, you need to have a large infrastructure, maybe 1000 GPUs, maybe 10,000, our largest customer has 100,000, and you run a model across the memory of all those GPUs, and you run it at multiple epochs. It takes months to train one of these new models that we know, like the ChatGPT style thing, and like Grok, takes months to train. And so, the data needs to be ready to keep those GPU busy when they ask for data. Otherwise, you might have $100 million, billion dollar data center sitting there just waiting for data. That's the critical thing we need to just make go away entirely. All we're doing is making those GPUs productive 100% of the time, and that's really what we've been doing for our customers. And that's accelerating data loads, accelerating model loads, and managing those checkpoints because large infrastructure fails all the time, and you don't want to get to the last day of the month and nearly be ready and have it fail.

Savannah Peterson

>> Right.

James Coomer

>> Right. You checkpoint regularly, you save that data, so if there is a failure, just restart from the nearest checkpoint. And that's more data moving backwards and forwards, and we're the company that does that most efficiently for our customers.

Savannah Peterson

>> I would imagine that helps with utilization, time to value, ROI, cost reduction.

James Coomer

>> This is what I was saying at the start. It's like, you choose a data platform vendor and normally you'd think, "I'm just buying some place to put my data," but actually, the impact that data infrastructure has on the rest of the productivity can be 25, 30% more. Just by removing all the IOA time, then those GPUs stay busy.

Savannah Peterson

>> That's a huge-

James Coomer

>> It's massive....

Savannah Peterson

>> improvement.

James Coomer

>> It's massive. It's massive.

Savannah Peterson

>> And decreasing waste. Oh, my gosh, my mind is just racing. Are there any case studies or stories of this really making impact for your customers that you're able to share with me?

James Coomer

>> A lot.

Savannah Peterson

>> Let's hear them.

James Coomer

>> I mentioned NVIDIA. NVIDIA chose us. And this is now six years ago, I think, with NVIDIA Selene. They were the first to build these very large scale models. It was Megatron-LM at that time. And so, we've been working with them, backwards and forwards, to make this whole process more efficient. Real large scale, real large language model development, and NVIDIA come to us and say, "DDN, we love you, but we see this piece that needs accelerating." And so, we go back into engineering and we develop some more stuff and we send it back to them. And now we've got maybe 30 different items in our historical roadmap, which we've done to accelerate this infrastructure. In the optimization, we're optimizing not only the storage piece, but the network piece, the ports on the DGXs, the NVIDIA systems, the CPU, the GPU, the containers and the AI frameworks themselves and the data path. It's not just storage at the end of the network, it's an integrated data intelligence platform that's reaching right across the network into the applications.

Savannah Peterson

>> It's thoughtful is what it is.

James Coomer

>> There is a lot of intelligence in there. This is really why these customers come to DDN. Swisscom, we also announced that recently. That's a large national infrastructure in Switzerland. They're building a national AI system to enhance the economy of Switzerland, and they chose DDN to accelerate their NVIDIA GPUs.

Savannah Peterson

>> Oh, that's cool. I'm going to have to look into that project.

James Coomer

>> Yeah. But they do it because we are the best in terms of ROI. You spend a little piece on the data and you get big returns back in that whole infrastructure.

Savannah Peterson

>> And probably the ability to continue to work and innovate with agility, with you as a partner in that circumstance.

James Coomer

>> As I said, all we do is HPC and AI. We're not distracted by other things. Our focus is entirely on these customers. And that's why I think we do very well in this world because we listen to these customers all the time. The customers are Nvidia, they're Swisscom, they're the large cloud providers, like Lambda, like Vulture, like Scaleway. They're all DDN customers and they're constantly interacting with us and asking for things and requesting enhancements and stuff. And we listen. We listen to a lot and we focus entirely on them and deliver the products that they want us to deliver.

Savannah Peterson

>> How do you prioritize a roadmap? Because I would imagine everybody wants everything from you.

James Coomer

>> Fortunately, people are asking for very similar things right now. We're in this phase of AI where... And this is one of the beauties of AI itself, actually. Think of HPC. We've basically got fluid dynamics, structures, we're modeling cars, we're modeling the environment. Everybody's got different requests. Now, the magic of AI is you can take one multimodal model and it can work with words, it can work with images, it can work with video, so we're concentrating all the effort into a relatively small number of very, very capable models. And the consequence of that, and the consequence of the fact that Nvidia really is the de facto environment, means that everybody, not exclusively, but mostly asking you the same things. We need to make this whole process, the end-to-end process developing new models, easier, faster, simpler, accelerated, turbocharged and do it with, of course, less dollars. And the data platform is the key area where we can make that happen.

Savannah Peterson

>> That's awesome. And you're right, it does make that easier rather than having just a whole bunch of ad hoc things coming at you. And I think that's actually... That's not necessarily in the spirit of collaboration, but I think it's representative of where we're at and everyone is collaborating and trying to build the same solutions here, so that everyone can go off and change the world with AI and save lives.

James Coomer

>> We see a lot of innovation out there with different customers, so they are doing different things, but it's all in one direction. It's all in one direction.

Savannah Peterson

>> Okay, that's interesting.

James Coomer

>> And so, when we make an enhancement for one major customer who is really doing something very special, we know that's going to have impact elsewhere as the rest of the world matures and sees these new opportunities for efficiency gains.

Savannah Peterson

>> I'm not surprised, especially with how long you've been in the game. I have two final questions for you. One, just because we don't have a lot of guests from Edinburgh, what is the conversation around AI like in Scotland right now?

James Coomer

>> Oh, you don't want to know.

Savannah Peterson

>> I do want to know. That's why I asked.

James Coomer

>> We're going to get into politics.

Savannah Peterson

>> No, it's not necessarily politics. I'm just curious. You're sitting at the pub having a cheeky bite. What's the dialogue like? I live in the Silicon Valley. I'm immersed in it. You can't take a step without hearing someone say GPU. I'm curious if it's similar.

James Coomer

>> It is actually very similar. It's an extremely historic town. The castle's are almost 1000 years old, right in the middle there, and you see it everywhere you go. The buildings are all ancient. But still, before I came out here, I met with some friends in the park, it was a bit cold, but anyway, we met in the park anyway.

Savannah Peterson

>> Cold. You're used to it, Scots.

James Coomer

>> We don't have normal conversations anymore. Halfway through the conversation, there'll be maybe something somebody says and it's like, is that true? It's like, "Hey, ChatGPT. Hey, Grok. Just give us answers on this." And then you got a real human answering you like a normal person and you can settle the argument really quickly. It's definitely becoming a part of everyday life.

Savannah Peterson

>> I love that. It's a banter tool.

James Coomer

>> It is. It's an argument settler.

Savannah Peterson

>> Oh, that was a better answer than I was expecting. Not that I expected a poor answer, but that was an interesting answer. Wow, that's fun. I need to start using that more when I'm deliberating, maybe even with our production team now that I'm saying that out loud. Last question for you, because this has been a lot of fun, James, and I knew you were going to be awesome, when we're hanging out, you're obviously seeing massive gains and improvement and innovation quite quickly, I'm impressed by a lot of the accelerated growth you're seeing, when we're sitting at this desk, this time next year, in St. Louis for Supercomputing 2025, what do you hope to be able to say then that you can't yet say today?

James Coomer

>> I think we're going to be able to say that we've enabled, again, a dramatic transformation in the world of AI. As I said, our customers really are doing the most dramatic things. xAI with Grok, massive customer of ours, they've literally built one of the largest AI supercomputers in the world. Now, they're not doing that for nothing. There's going to be some very dramatic things. Remember that ChatGPT moment a year ago or so?

Savannah Peterson

>> Oh, yeah.

James Coomer

>> Really changed everything.

Savannah Peterson

>> Yeah, yeah. Two years ago, we just had a two-year anniversary of ChatGPT.

James Coomer

>> My hope is the same thing happens again. It's probably going to be around multimodal, just bringing in, more comprehensively, all these different forms of data and basically enhancing our overall experience by bringing in sound, audio, video, and making it a more day-to-day interaction. And we think we are a crucial part of that story.

Savannah Peterson

>> I think you're a crucial part of that story as well, James-

James Coomer

>> Thank you.

Savannah Peterson

>> And I look forward to talking about that next year with you. Thank you so much for coming to hang out this morning.

James Coomer

>> Oh, thanks a lot. It's been great.

Savannah Peterson

>> Yeah, this has been a joy. And thank all of you for tuning in. We're here in Atlanta, Georgia, day three of Supercomputing 2024. My name's Savannah Peterson. You're watching theCUBE, the leading source for enterprise tech news.