theCUBE + NYSE Wired: Zero Trust Cyber Series | Ramine Roane, AMD

Clips
More from theCUBE + NYSE Wired: Zero Trust Cyber Series

Ramine Roane

CVP of AI

AMD

play_circle_outline AMD's leadership position in GPUs and focus on AI

play_circle_outline AMD's chipletized architecture and advanced packaging technology

play_circle_outline Unlocking Innovation: Importance of Systems Expertise in Z Systems Acquisition and Advancements in Ethernet Networking

play_circle_outline Exploring the Potential for AGI and Human-like Driving in the Next Decade

Info
Transcript

Ramine Roane, AMD

Ramine Roane

CVP of AI AMD

The discussion between Dave Vellante and John Furrier focuses on AMD's role in the AI market, highlighting their latest GPUs optimized for AI workloads. AMD has seen significant growth and adoption, with some of the largest supercomputers using their GPUs for AI tasks. Despite competition from NVIDIA, AMD's focus on high bandwidth memory and advanced packaging technology gives them a competitive edge. The conversation also touches on systems design and networking in AI infrastructure. AMD's recent acquisition of ZT Systems showcases their commitment to provid... Read more

explore Keep Exploring

What is the current situation in the AI markets and AMD's previous focus in terms of GPUs? add

What advantages does AMD have on the silicon and advanced packaging side? add

What are the trends and technologies in networking that are important to be aware of, particularly in terms of system on chip networking and inter-networking across clusters? add

What is the possibility of achieving AGI within the next 10 years and being able to teach an AI to drive a car without prior training? add

bolt Powered by CUBE AI

Ramine Roane, AMD

search

Dave Vellante

>> Hi, everybody. Welcome back to Media Week. My name is Dave Vellante and I'm here with John Furrier, who's also in the house. This is NYSE Wired and theCUBE's week-long production media week of cyber and AI innovators. And one of the key innovators is where it all starts in Silicon, AMD is a company that is at the heart of the innovation in AI. Ramine Roane is here, he's the corporate vice president of AI at AMD. Thanks so much for taking some time with us and coming into theCUBE.

Ramine Roane

>> Thank you very much.

Dave Vellante

>> So the proverb, may you live in interesting times. We sure do live in interesting times. We're now entering a wave. I've been around long enough to see the mainframe to PC wave, the internet wave, the mobile wave, the cloud wave, the data waves. And this one, it seems everybody agrees, almost everybody, is bigger than any of them. You guys have made amazing story. When you finally get rid of Foundry, your business took off. You have been dominating in the x86 market gaining share, but that x86 market is giving way to the AI market and the whole GPU wave. So, where are we at with AMD from the corporate vice president's perspective?

Ramine Roane

>> Sure, that's a very good observation. The AI markets right now is really taking over and dwarfing all the other markets for compute. And AMD actually has had a leadership position in GPUs, but not specifically focused on AI in the past, more focused on HPC. We do have, for example, the number one and number two biggest, most powerful supercomputers in the world. The one that was number one up until a month or two ago was Frontier, which is built by HPE.

Dave Vellante

>> HPE, yeah.

Ramine Roane

>> Which is based on our previous version of GPU, the MI250X. And it was the first exascale computer, exascale meaning 10 to the power of 18. It's a billion, billion operation per second.

Dave Vellante

>> A lot of zeros.

Ramine Roane

>> A lot of zeros. And we're not talking AI operation. It's not eight bits or four bits, right? It's 64 bits, a billion billion 64 bit operation per second. But then we decided to go into AI about a couple years ago and our latest GPU, the MI300 series is actually optimized for AI. We started to sell MI300X this year in January, and we already generating $5 billion this year is, I think that's the official number. The most complex AI workloads on the planet are working on MI300X right now. GPT4, that was announced by Satya Nadella at Microsoft build their developer, but also Meta announced that they're serving Llama 3-405B, their biggest Llama exclusively on AMD because we have a massive memory advantage against our competitor, NVIDIA. We have a 2.4X HVM memory advantage because of some technology, advanced packaging technology that we leverage from the Epic side of our world. And having a lot of HVM is very favorable for large language model and that's why they're using that. And we are also running a recommender systems search and recommendation at other hyperscalers.

Dave Vellante

>> A lot to unpack here. So yes, I think Lisa has put in your quarterly earning statements that the 5 billion numbers. She's also put forth a TAM number, which is enormous. And I think that's the thing that people, and I'd love your feedback on this, is they don't seem to understand it's not a zero-sum game. The markets now are so big, it kind of used to be one company, Oracle wins the database war, everybody else picks up the scraps. But it's not like that in these days. It's not that way in cloud. It's certainly not that way in software. It's not that way in semiconductors and it's even, the market's getting bigger and bigger. And so I'd like to ask you, explain the advantage that you see having from the Epic architecture. Obviously memory is super important. Nvidia obviously has a big lead. They've had some other advantages and I'd like to understand how you're going to close the gap, but from a hardware architecture standpoint, maybe you could describe that for us.

Ramine Roane

>> I would say that on the silicon and advanced packaging side, AMD does have the lead right now in that we've been building chiplatized devices for a while on Epic. Actually our big compute density and also power advantage came in when we started to chiplatize the device into compute dies. And we also build versions where we have memory on top. So it's like a 3DB cache, that's how we call it, but it's really 3D on 2 and a half D. And that's the type of technology we used for our Instinct GPUs where we have an IO die on the bottom that has all the IOs and the memory hierarchy, and we have the GPU compute dies on the top, basically packing things in 3D, which our competitor is not doing. That gives us more room in the package to put a lot more HVM memory. That's how we get the -

Dave Vellante

>> This is a TSM process?

Ramine Roane

>> It's TSMC. We both use TSMC.

Dave Vellante

>> Right. Understood. But like you said, they're not doing the 3D. They had to sort of dial back down the latest packaging just to get the thing working and get out there. So you're going right after their Achilles heel, which is power. They're expensive, they run hot.

Ramine Roane

>> That's right, that's right.

Dave Vellante

>> But let me flip that and get your perspective on this because a chiplet size. What they would say, what some would say, okay, but they share a big SRAM which allows all the GPUs to talk to each other and synchronize and it's asynchronous. Whereas the chiplets have to go through memory and it's asynchronous. But you would say there's a large memory. Help us understand that. I think this is in the weeds and it maybe doesn't matter, but from an architecture standpoint, this is like religion. I mean, you have to pick a bet, you have to double down on that architecture. And you guys have clearly doubled down, as have most, on the chiplet architecture. And we know large SRAMs are kind of running out of gas because... I say run out of gas. They don't scale as well because they're taking up more and more higher percentages of the real estate. So that is coming to an end. But to help us understand, you know what I'm talking about here and you can add more color because you're much more technical than I am. Help us understand the truth here.

Ramine Roane

>> So there are companies who are building AI chips with only SRAM, no HVM memory. I mean-

Dave Vellante

>> Apple would be an example, right?

Ramine Roane

>> Yes.

Dave Vellante

>> But different market.

Ramine Roane

>> I was thinking on the data center side, I was thinking of the likes of Cerberus or Grok. But those obviously don't have the memory density. So LLM in particulars, they're very big like LLM-7EB. For example, if you're working in eight bits, that would be 70 gig of memory. If you're working on 16 bits, it's twice more just to fit in the GPU memory. You have to be able to fit it. So when you're working with SRAM, you don't have that kind of capacity. So to partition something as small as a Llama-7EB, you will need hundreds of chips if you're only using SRAM, that's what those vendors do. Right now, they give that away for extremely cheap and it runs fast, but it's not something that's sustainable. It's great, it's super fast. It's not sustainable because the cost is so massive-

Dave Vellante

>> The economics don't network.

Ramine Roane

>> The economics don't work.

Dave Vellante

>> But in the case of NVIDIA, they're having to push their networking in new directions to accommodate this.

Ramine Roane

>> And you use these HVMs.

Dave Vellante

>> Yes, of course. Right.

Ramine Roane

>> So they have quite a bit on the H-100, they had 80 gigabytes, so the 7E would fit in that. We have 192.

Dave Vellante

>> Right, so you have an advantage there.

Ramine Roane

>> You can fit much bigger LLM, of course, you can always chart an LLM across multiple devices, but-

Dave Vellante

>> That has trade-offs.

Ramine Roane

>> we can host much bigger LLM in a single GPU.

Dave Vellante

>> Which means more efficient-

Ramine Roane

>> More efficient-

Dave Vellante

>> and it's all across.

Ramine Roane

>> Cost effective.

Dave Vellante

>> Price performance, better power profile.

Ramine Roane

>> Better power profile. For example, if you're serving a Llama 3-405B, we can do that in one server which has eight GPUs. You would need two of those with the H-100 or even the H-200.

Dave Vellante

>> How about the surrounding infrastructure? How big is NVIDIA's moat from your perspective?

Ramine Roane

>> So, most people think of CUDA when they think of the biggest moat. And turns out CUDA is actually not a moat at all because in term of programming language, it's C++ and we have the same, we're pretty much CUDA compatible. If you write a code in HIP, H-I-P, which is our CUDA, it'll compile in the NVIDIA compiler and run on NVIDIA hardware as well as compiling in RockM our software and running on AMD hardware. Now-

Dave Vellante

>> So you've been able to neutralize that moat through code?

Ramine Roane

>> But then people would say yes, but they have a lot of libraries. They've been writing libraries for 15 years.

Dave Vellante

>> What about that?

Ramine Roane

>> So the thing is those libraries do run on the next GPU from an older version to a newer version of the GPU, but they don't run at performance. So all these libraries have to be rewritten at every generation. And right now we're both at a cadence of building a new architecture every year.

Dave Vellante

>> Every year.

Ramine Roane

>> So it's a completely new architecture every two years and an incremental upgrade every year. So every year we have to actually rewrite the libraries so they run optimally.

Dave Vellante

>> Oh, interesting.

Ramine Roane

>> So that's why there is no moat.

Dave Vellante

>> We don't hear that fine print on the earnings sale, but they do make... NVIDIA, Jensen makes a big deal out of the backwards compatible, the forward compatibility, but there's work that has to be done.

Ramine Roane

>> There's work. So obviously they're very good at it, they're very fast at it-

Dave Vellante

>> Of course.

Ramine Roane

>> but it's not as much as a moat as people think because everything has to be redone.

Dave Vellante

>> So that's great context. I would say the other moat is the systems expertise.

Ramine Roane

>> That's correct.

Dave Vellante

>> You guys acquired these systems, which I thought was a brilliant move, and it's in recognition that this is a systems game. It's not just about the chip and making the chip go faster. Yeah, that's part of it, but there's other... So why is systems design so important? Why is the Z Systems acquisition relevant?

Ramine Roane

>> So, right now with the current generation of GPUs, most cloud service provider by servers that they put together in a rack and they build a cluster and so on. In the future, NVIDIA is trying to sell racks like 72 GPUs and we're going that way as well, so we'll have both servers and rack scale. ZT Systems basically provided clusters, built clusters for hyperscalers like the biggest of the hyperscalers. So they do have this expertise. They're among the best in the world. So because in our roadmap we have our next, next generation GPU that will have that rack scale capability, we wanted to bring in the expertise from ZT Systems to build those products and make sure that it has all the features needed that hyperscalers are looking for.

Dave Vellante

>> What should we know about networking? Well, networking within a system on chip and inter-networking between across clusters. What should we know about that? I mean, obviously the Mellanox acquisition catapulted NVIDIA's business, but ethernet certainly we were at re:Invent last week, I guess two weeks ago because we're publishing this next week. All the hyperscalers using ethernet as their backend. It's a standard.

Ramine Roane

>> .

Dave Vellante

>> It runs the internet. Don't bet against ethernet is probably a good bet.

Ramine Roane

>> That's right.

Dave Vellante

>> What should we know about networking? What's AMD's point of view on this?

Ramine Roane

>> So NVIDIA uses NVLink to interconnect GPUs locally, and then as you said, they have Mellanox with InfiniBand to scale out. We have what we call the Infinity Fabric to interconnect our eight GPUs, for example, in a server. And we are going with ethernet and ultra ethernet is coming, which has better metrics than InfiniBand.

Dave Vellante

>> Even Nvidia is going ethernet.

Ramine Roane

>> Even NVIDIA is going ultra ethernet. Exactly. And a protocol like Rocky V-II, for example today with regular ethernet is actually pretty good as it is. Ultra ethernet will be significantly higher and obviously Azure is deploying us, Meta is deploying us, other hyperscalers are deploying us. They all use ethernet and looking forward to run even faster with ultra ethernet.

Dave Vellante

>> So, it really comes back to this market is so huge. I also want to tap your brain on training and inference. I think we, years ago, even before gen AI hit the market, said that we felt like inference was going to be the big market and the most use cases, especially when you start thinking about edge applications. How should we be thinking about inference? It's funny, NVIDIA says that a very large portion, I forget the exact number, maybe close to half of its enterprise business is inference. I've always assumed that's running ChatGPT queries. So okay, that's inference. But there's other inference, as I say, out at the edge. How do you think about the training versus inference? Are they simpatico? Are they different markets? What kind of architectures are required? Obviously there are a cost considerations. Help us squint through that.

Ramine Roane

>> Yeah, initially we thought inference would be significantly bigger than training, and it kind of will be, right? Makes sense because you train a model once and then you infer it at scale. But models are becoming so big right now that training, the latest, greatest LLMs, the massive LLMs are taking a lot of resources, a lot of training resources. So training is still pretty big.

Dave Vellante

>> Sucking up a lot of demand.

Ramine Roane

>> Yeah, in term of compute resources. But definitely inference is getting bigger. It is getting bigger. As far as our GPU goes, our GPU does both just like NVIDIA's, but we decided to go primarily with inference first, so the ChatGPTs and Llama 405Vs. And the recommender system, they're all on the inference side. We have started to work with some customers in training and we are seeing near linear scaling when you go from one GPU to hundreds to thousands of GPUs. So our GPU can really do both. Now whether we are going to have inference-specialized GPUs versus training-specialized GPUs, in the future it's a possibility. But right now the intersection is pretty big in term of what you need.

Dave Vellante

>> I wonder when you mentioned, you said scaling, you're talking about the GPU scaling, but there's also the scaling laws. And I want to ask you about this because the economics of training large LLM language models is terrible. I mean, I think it was $190 million to train Gemini Ultra. Now of course they had a big context window, I understand that. What Elon's building with Colossus, if it's really a hundred thousand GPUs, it's got to be billions of dollars to train. And so maybe if you're not familiar with the scaling laws, it's really just the bumper sticker is you've got to have compute, you've got to have data, and you've got to have parameters, which is the weights and the biases. You've got to have all three in order to scale. If you just try to scale compute without the data or the parameters, you run it to diminishing returns. And so we think that whatever Grok is doing in Memphis will prove or disprove this next generation of scaling loss. We'll see. We'll see. But irrespective of that, the economics are very expensive, the costs are very expensive. The price per token has come down, in GPT3, GPT3.5, it's come down four orders of magnitude in three or four years. GPT4 is on a path to come down two orders of magnitude in 24 months. So, cost is going through the roof, price is coming down by orders of magnitude. I mean, really, really tough economics. Now, if you're Facebook or Google, you can sell ads. Bigger GPU clusters mean you can sell more ads, I guess. Okay, so the economics for consumer, maybe not so bad, but enterprise AI, different story. You're going to have to show ROI there. And that's really the market that ultimately you're kind of going after with I would think with inference, more cost-effective, lower power, very, very high volume. What are your thoughts on all that? See, I see them as two different markets. The big LLMs are sucking up all the demand right now, and that's wonderful, but it's going to be lumpy at some point that's going to end. But the real interesting opportunity when you think about NYSE and all the companies in this financial district is really enterprise AI. What are your thoughts on all that ?

Ramine Roane

>> There's a lot to unpack there. So first about the scaling laws. I mean, are we running off data, for example? Yeah, I think a lot of the data, which is publicly available, has been used already to train those massive LLMs. Of course, there is a lot more private data out there. That's one source of data.

Dave Vellante

>> If I could interrupt, sorry. But because people say, "Well, the LLMs, they could run synthetic data," but that's fine. But that synthetic data is different from that private data. That private data is never going to get into those LLMs, I hope.

Ramine Roane

>> Right, right.

Dave Vellante

>> So sorry to disrupt your thinking.

Ramine Roane

>> But beyond synthetic data, when I'm talking data here, right now it's text only, right?

Dave Vellante

>> Right.

Ramine Roane

>> So fine, let's say we trained the latest GPT on almost all of the texts ever written by humanity, which is public, however those models are now multi-model. So they also run from images and if you take the case of Sora, it's also learning from videos. So now the source of new data is almost in finite.

Dave Vellante

>> Yeah, okay.

Ramine Roane

>> Really, when you think about it, GPT3 for example, was trained on a lot of texts, more texts than you and I can read in the next a hundred thousand years or million years. But probably when we were five years old or nine years old, we could do a lot more than GPT3 because we learned from vision, from seeing things, from experimenting with the laws of physics and breaking things, right? GPT never experience these things. With video, now it can.

Dave Vellante

>> I see. Yeah.

Ramine Roane

>> And I think it's going to keep on going. There's going to be touch, there's going to be hearing and smelling, which is analyzing the molecules in the air. So there is an awful lot more data to train on.

Dave Vellante

>> All those human senses are going to get digitized.

Ramine Roane

>> All those human senses, yeah.

Dave Vellante

>> That's a really... I had not thought about that deeply. Okay, so we're not going to run out of data.

Ramine Roane

>> And I believe that's really what you need in order to get to what they call AGI. You need to be able to learn from all the different senses. Of course, the one thing the LLMs will never have are the hormonal responses, but everything else at least should be available to one LLM.

Dave Vellante

>> I have to ask, where are you personally? Don't make this AMD's opinion, but some pundits and experts feel like we'll have AGI before the end of the decade. Others say, no way. Do you have an opinion?

Ramine Roane

>> Okay, so again, that's not AMD's opinion. I think in the next 10 years it's possible, but we have to define AGI. Think it's really having the breadth of learning new tricks that humans have. For example, when you're 16 or even 14, I can teach you how to drive a car in a few hours, right? Especially if there's no stick shift, it's even easier. I can show a 14-year-old how to drive a car for a few hours and then put him at the driver's seat and he will drive. No AI really can do that today. If I put the latest, greatest LLM in a robot and say, "Hey, I'm going to show you how I drive, and now you're going to drive in New York City," we're going to have a crash, a really bad crash. So are we going to get there in 10 years where I can put an AI in a robot's brain and it can learn how to drive a car whereas it wasn't trained for it?

Dave Vellante

>> Maybe.

Ramine Roane

>> Conceivable. Yeah, but we will get there, it's just a matter of when.

Dave Vellante

>> The technology exists. And if you look at the exponential growth, I agree, we'll get there. But you're bringing up a great point. There's a reason why we don't let people drive until they're 16, right? Their brains have to mature. They learn from surroundings, they observe.

Ramine Roane

>> That's right. They experience with the laws of physics for 16 years and now we trust them, right?

Dave Vellante

>> That's why full self-driving is maybe taking longer than a lot of people expected. I mean, you're a fantastic guest. Again, thank you-

Ramine Roane

>> Thank you so much.

Dave Vellante

>> so much for coming on and being so candid and best of luck. I hope we can have you back on in the future.

Ramine Roane

>> Thank you very much. It was a pleasure.

Dave Vellante

>> All right, and thank you for watching. This is Dave Vellante for John Furrier and theCUBE and NYSE Wired's Media Week, AI and cyber innovators. We'll be right back right after this short break. You're watching theCUBE.