theCUBE + NYSE Wired: AI Factories - Data Centers of the Future | Anush Elangovan, AMD

Clips
More from theCUBE + NYSE Wired: AI Factories - Data Centers of the Future

Anush Elangovan

VP, AI Software

AMD

play_circle_outline Anush Elangovan on AMD's AI Software Innovations and Strategic Partnerships with OpenAI and Oracle in Data Center Solutions

play_circle_outline Transforming Supercomputing: AMD's Integrated Approach to AI Infrastructure through Advanced Hardware and Software Solutions

play_circle_outline Overview of AMD's Helios rack-scale supercomputer and its specifications.

play_circle_outline AMD's aim for zero friction adoption of AI technologies in enterprise environments.

Info
Transcript

Anush Elangovan, AMD

Anush Elangovan

VP, AI Software AMD

In this interview from theCUBE + NYSE Wired: AI Factories – Data Centers of the Future, Anush Elangovan, vice president of AI Software at AMD, joins theCUBE’s John Furrier to unpack how AI factories are turning the modern data center into a supercomputer. Elangovan details AMD’s open, multi-generational platform approach spanning EPYC CPUs, Instinct GPUs, Pensando NICs and the ROCm software stack – built for reliability, security and performance at scale. He shares highlights from OCP’s Helios rack-scale system and why GPU memory capacity and bandwidth are cr... Read more

explore Keep Exploring

What recent developments and relationships related to AMD and AI infrastructure have been highlighted in discussions? add

What is being discussed regarding the impact of AI on industries and the role of AMD in developing the necessary technology? add

What are the software components of the Helios system that connect the MI450 GPUs, HBM4 memory, and networking? add

What is the AMD ecosystem and how does the developer strategy support it? add

bolt Powered by CUBE AI

Anush Elangovan, AMD

search

>> Hello, I'm John Furrier, your host of theCUBE here at our NYSE CUBE Studios. Of course, we have our studio in Palo Alto, connecting Silicon Valley and Wall Street, part of our NYSE Wired series on the AI factories, the future of the data centers. Anush Elangovan is here, VP of AI software at AMD. Welcome to the program, welcome to our series on the future of data centers, which is now large-scale supercomputing. Anush, thanks for coming in.

Anush Elangovan

>> Thank you, John, for having me.

>> So a lot of the conversations in the mainstream is highlighting AMD's rocket ship past couple of weeks, obviously the OpenAI relationship, Oracle announcement, but also, there's a lot of other hidden gems out there. Not hidden, they're out in the public domain. Open Compute Summit is this week, Helios rack-scale, changing the game there, open standards, there's a lot going on in this build out of these large-scale clusters that are basically enabling the massive growth of AI-native, AI infrastructure applications, agentic, we were also at Dreamforce with theCUBE. You're starting to see the pressure points coming in on the agentic side, which is going to fuel more context windows that's going to need tokens for those agents. Again, the tokens are feeding everybody. So this is a massive shift, it's a generational shift, and the number one topic is, how do I build the platforms to power all these other platforms? So that's the infrastructure. So it's not just hardware, it's software. So you've got the keys to the kingdom over at AMD, tell us, what is the challenge right now that you're working on, what are you focused on? Because this is a critical disruptive enabler for a whole generation of entrepreneurs, software stacks. What does the OS look like? Take us through what's happening.

Anush Elangovan

>> Yeah, yeah, definitely. John, you said it right, it is transformational. I liken it to electricity, it's as big as that, it's like we invented the new electricity and it's going to transform every part of our life. And what we are focused at in AMD is building the compute platform, both at the hardware level and at the software level, and then bringing the solutions so that the transformations that happen at these AI factories and just transformation of entire industries with the enablement of AI is our focus. And to enact that, it all comes down to how you can get the compute required to power this AI infrastructure. And compute, AMD has had a very good, successive generational delivery of hardware, and now we are focusing on the software layer, so that you have a pervasive software layer that goes on top of the hardware layer that can then enable all of these AI factories to be built and transform lives, which it's just amazing to watch in real-time.

>> Dave Vellante and I talk about on our CUBE Pod all the time, every Friday, around AMD, what have they got going on? First of all, I'm like, "Dave, I think AMD is going to have... I think they're hiding the ball a little bit." That's not hidden anymore. Obviously, the success just in the alignment in the ecosystem. Like I said, Open Compute, you guys launched Helios rack, Open Source Week next week is coming with the Linux Foundation. Take us through some of the key market dynamics around what you're working on, because it takes a village. It is like electricity, a lot of people are experimenting. What's the famous Edison quote? "I failed 10,000 times before I figured out how to do electricity."

Anush Elangovan

>> Exactly.

>> It's not that much failure going on now, but there are some stop-starts, there's some success patterns. You're starting to see visibility now. Can you share your thoughts on this evolution? What is working? What are some of the blockers? What are some of the things that need to be reinvented? Take us through, because it's not as obvious as just building a server.

Anush Elangovan

>> Perfect. So I'll answer your question in three pieces. The first one is how we do an open ecosystem, how we participate in open source, open ecosystems. So we started with a clean slate of having a platform that anyone can contribute towards, and so all of our ROCm software is open source. And then, we also have an open ecosystem, so that it's not just the software, we want other companies and constituents to come into our ecosystem and move the ball forward in terms of innovation. So the Open Source Week next week starts with the AMD AI DevDay on Monday in San Francisco, and so we bring all the ROCm developers, AI developers, to one location, share thoughts, best practices, tutorials, and get them to start working on some of these new agentic workflows, et cetera. And then, we have the PyTorch Conference, the Triton Conference, all of which AMD has a very big footprint on, and we are very excited about the Open Source Week next week. We also have the Helios rack that was announced at OCP. Helios is a supercomputer, it is literally a behemoth, both in computing power, and also literally in terms of how much it weighs. But the good thing about Helios is it's a culmination of years of innovative research on hardware and now software, and we want to bring that in a compact form factor... I wouldn't say compact. It's compact enough for a data center, so that you can rack a bunch of these and go to gigawatt-scale. So as you mentioned, as we are now talking about gigawatt-scale deployments, one of these racks is literally a supercomputer. And so, we want to build that hardware layer, with a very robust software layer, in which you can build your AI factories, your industries, your software innovations, on top of AMD hardware and software.

>> On the Helios, I'm just going to read some stats here, I want ask a question. You've got the MI450 GPU deployed as part of the Helios system, with access up to 432 gigabytes of HBM4 memory, a total bandwidth of 19 terabits per second, 72 of those GPUs in each system, can deliver 1.4 exaFLOPs of FPA performance, with 31 terabytes of HBM4 memory overall. Again, you mentioned the multi-generational investment. And then, Oracle had a term called engineered systems back in the day, you have an engineered a supercomputer. So in the OPC standards, all good, power and cooling fits in, that's some serious horsepower. Now, you've got the GPUs, you've got the HBM, you've got the networking, you've got to put them all together. This is system software, so I have to ask you the question, this is where the innovation right now is mostly in, what is the software component of that? Because you've got to connect all those, you've got to network them. I've said networking is the operating system for these large systems, because you've got to coordinate memory. But it reminds me of the '90s, when I was in school, you had to write software for memory management. Remember the old days? Swapping out the disc and doing those things. I'm dating myself. Actually, it was the '80s, actually, the '90s too, but both. But now, you have a similar system/software/engineering challenge. Take us through, because this is super important, because it's got a scale-out, scale-up impact, and it also will impact the performance of the actual jewels, which is the GPUs and the compute. So take us through the AMD differentiation, how you guys are thinking about that software, those fabrics, you've got network fabric, you've got storage fabrics, and you've got high-performance SSDs coming soon. I did a whole thing on Silicon last week, that's coming too. So it's mind-blowing. Where's the software change? What's the system look like? Can you share your thoughts and unpack that for us?

Anush Elangovan

>> Yep. It's a very good question, and like you said, it is a networked computer. We have AMD EPYC processors, we have the AMD Instinct GPUs, we have the AMD Pensando NICs, and most importantly, we have the AMD software, the ROCm software, on top of it to stitch all of these pieces together to build a supercomputer. And it is a supercomputer in the traditional sense, because the stats that you just mentioned, like 432 gigabytes of HBM per GPU, and we have 72 of these GPUs, that is just-

>> Unbelievable.

Anush Elangovan

>> Yeah. That was the size of the internet in the '90s in one GPU. And it's not just that it is... It is actually being processed and being used to give you the reasoning power and the capabilities that you're getting used to with ChatGPT and other frontier models, because all of these models are in memory, and AMD has had a significant bandwidth advantage and a memory capacity advantage. Even in the current generation, if you look at what is mostly deployed on AMD platforms, we are like 1x to 1.5x larger memory, which means your models can be, instead of doing a 70 billion Llama, you can do a 120 billion gpt-oss model, and the difference between a Llama 70 billion and a gpt-oss 120 billion is immense. And so, the focus for us has been to get these large models in. But to your point on the software, I'll date myself a little bit too, I used to look up on Tandem NonStop OS, and I don't even remember them from Tandem and Compact days. But the ability to keep the systems running while you're still being performant and you gracefully degrade is all something that you can't just bolt on at the end. It is not a, oh, I'm just going to install a little package and it's going to do something. You've got to go first principles on everything from the firmware to your PCIe root complexes, your GPUs, your CPUs, and if there is a failure in one of these cases, then how do you recover from it? How do you ensure that a large service, such as ChatGPT or such, doesn't get affected because there's a cluster that went down, or a node that went down? So reliability is built into the platform, security is built into the platform, performance is built into the platform, and most importantly, we are making it a robust platform that is going to be multi-generational. So Helios is just the start, and you've seen all the market news on the adoption of the 450 platform, but that is just the beginning of what we think will be a exponential growth trajectory for AMD systems and solutions.

>> All right. So my next question is, you're starting to see the segmentation on the customer base, you've got the mega, I mean mega in large, not megawatts, I should call it giga-

Anush Elangovan

>> Giga....

>> giga factories, because you've got the big hyperscalers, you guys partnered with Oracle, OpenAI and others, and then you've got the enterprise. Enterprise might not need the giga or mega version, they're going to need to have a rack dropped in. There's a real question around, how do I change my storage strategy, my networking stuff? Obviously, you've got the racks. But it seems like, in the enterprise, each customer is like a snowflake, it's going to be different. How do you view that and what's that mean for the enterprise, who are going to lag a little bit behind some of the big hyperscalers, because they have the demand? And again, it's really not a competitive question, because the demand for systems is so high, everybody's winning.

Anush Elangovan

>> Exactly.

>> So the question really is around, how do I get these things in a steady state in the enterprise? Because they have data centers, there's a lot of on-premise activity for processing their data, and the edge is right around the corner. You can do inference on the edge today, but training's going to be needed on the edge too soon. So you're starting to see the dots connecting, it goes from the big clouds, hyperscalers, neoclouds, to now enterprise and then edge. This is distributed computing 101.

Anush Elangovan

>> Exactly.

>> What is the answer for the enterprise? Is it custom work? Is the standards solve that? Take us through your thoughts on vision on that, because that's a big part of the market right now, we think on the growth side.

Anush Elangovan

>> Yeah, yeah, 100%. So the way I think about it is it's one cohesive pervasive platform. It starts with the gigawatt-scale deployments, and those are usually bespoke, they have their requirements, they want to optimize to the electron-level, because there's millions of users and you optimize that and you get a payback for that. But our philosophy is to build for the entire market. So we want to be able to invest heavily so that it is a fully validated solution that can be deployed at MDCs, but on-prem, and all the way down to client. In fact, I have a Strix Halo, the Ryzen Max 395+ laptop right here next to me, and I use that for my common LLM development and summarization, utilization, all of that, AI on the edge. But to your question on enterprise, we want to make it easy to consume. We want to be able to bring the innovation that's happening at the giga-scale, packaged, validated, and in an a la carte menu for enterprises to consume, so that each of the, quote, unquote, "snowflakes" that you mentioned should be able to take what best fits them and be able to construct them and make sure that the solution is still viable for them and has the ability participate in this AI revolution. So our software strategy is pervasive, as much as our hardware strategy, and the ability to focus across all of these verticals is super important for us.

>> Okay, so I hear that. I want to ask the next question, which is, okay, you got developers, you've got the Open Source Dev Week next week at the Linux Foundation. Props to the Linux Foundation, they do great work, we love those guys, great group over there. Now, you've got the enterprise has now the supercomputing capability coming. What do you run on that? Because you've got the developers on a feeding frenzy right now looking to get engaged. What stack runs on that? Because the enterprise game is about the ecosystem, they have pre-existing stuff, they've got Dell, they've got HPE servers laying around, they've got top-of-rack switches, they've got all kinds of old gear, and now they're going to drop in a supercomputer. What do you run on it? You don't just load Linux on a supercomputer, because Linux is everywhere. So in the old days, you get a server, you load Linux on it... Half the folks in the market have never even loaded Linux on servers. So what is the software stack? Because startups are coming out with technology that they want to sell to the enterprise. Do they sit in somewhere? So I'm sure ecosystem's on your mind, what is the AMD ecosystem and what's the developer strategy to support that?

Anush Elangovan

>> Yeah, that's a very, very good question. I think the way we approach it is that you should have no code changes, or any changes at all, for you to be able to drop in AMD into any of your existing footprint. If you look at the CPU environment, AMD already has a very, very large footprint in the enterprise, and as part of that, now it's just incremental. We already have these enterprise engagements, and now we say, "Here, drop in the GPU, and you will get these AI capabilities that you can continue to consume." So the way we approach it is we want zero friction, from a software standpoint, for the enterprises to adopt AI, and that is a goal for us to make sure that it's very easy for you to deploy. And in terms of Linux, we have made our AMD ROCm stack available both on Linux and Windows, and the Windows is in preview right now, but by the end of the year, you'll have it in production, so that even on Windows systems, you have the same capabilities to consume compute that is backed by AMD GPUs.

>> So what are you guys going to be doing at the Developer AMD Day in San Francisco next Monday? What's the agenda? What's going to be out there? Put a plug in for the developers watching.

Anush Elangovan

>> Oh, yeah, definitely. So AMD AI Day is focused on developers, and it's an opportunity for us to hear from customers, very established research institutions, and then just developers in general, and try to get them a place to share their thoughts, understand what is top of mind, help them walk through the... If there are any problems, any issues. But more importantly, get the community together and celebrate what we've achieved so far, and then see what we can do in the future. But also, we'll have a few surprise announcements there and good surprise speakers that will be there for the AI DevDay.

>> Make sure we get that news on SiliconANGLE, for sure. And also, there's other events going on for the rest of the year, and obviously going into next year. You've got MWC right around the corner, which is one of my favorite events, we'll be there too. You've got a lot of stuff going on. What should we pay attention to? Where are you guys going to be at? Of course, we want to follow the developers in the ecosystem specifically, we'd like to do a drill-down on that, if you don't mind, at another time. But what's the key events that we should pay attention to for AMD right now?

Anush Elangovan

>> Yeah. So obviously, the AI DevDay next week, October 20th, please come by if you're in San Francisco or dial-in. We will be in supercomputing, CES, obviously, like you said, MWC. And then, going into the later part of the year, we have our own Advancing AI event, so that'll be where we also continue to talk about our current hardware and then what's in the pipeline.

>> All right. Well, we're going to have to get our events schedule retweaked, because we want to be at those events. Anush, thank you so much for sharing. I'm sure we'll follow up with you, a lot to unpack here. Again, congratulations, the past few weeks, the unveiling of AMD's bringing out the goods and doing the deals, setting the table for this new generation. Thanks for coming on and sharing.

Anush Elangovan

>> Perfect. Thank you for having me, John.

>> Great conversation. AMD senior leaders on the software side breaking down the future of the data center and these large-scale clusters are becoming the supercomputer. The data center is the supercomputer. We are now actually in the supercomputer era for the first time, even though supercomputing events have been around since 1988. We're going to be covering, of course, like a blanket. I'm John Furrier, your host of theCUBE. Thanks for watching.