SC25 | John Cazes, TACC & Tony Rea, Dell Technologies

Clips
More from SC25

John Cazes

Director of HPC

TACC

Tony Rea

Global AI Infastructure Lead, Global Tech Officer

Dell Technologies

play_circle_outline Introduction to supercomputing event with hosts John Furrier and Dave Vellante.

play_circle_outline Overview of TACC's new Horizon project launching with 4,000 GPUs.

play_circle_outline Historical context comparing past to current excitement levels in supercomputing.

play_circle_outline Collaboration between TACC, Dell, and NVIDIA in designing the Horizon system.

play_circle_outline Focus on liquid cooling systems to manage high-density GPU setups efficiently.

Info
Transcript

John Cazes, TACC & Tony Rea, Dell Technologies

John Cazes

Director of HPC TACC

Tony Rea

Global AI Infastructure Lead, Global Tech Officer Dell Technologies

In this SC25 interview, John Cazes, director of HPC at the Texas Advanced Computing Center (TACC), and Tony Rea, global AI infrastructure lead at Dell Technologies, join theCUBE’s John Furrier and Dave Vellante to unpack the Horizon supercomputing system. Cazes explains how Horizon is rolling out with 4,000 GPUs, more than 4,700 compute nodes and an expected 10x increase in performance for open science workloads, making it the largest, most powerful system TACC has ever built. Rea adds perspective on the long-running TACC–Dell–NVIDIA collaboration and the des... Read more

explore Keep Exploring

What updates were shared regarding the Horizon project at the Texas Advanced Computing Center during Supercomputing 2025? add

What are the details and expected impact of the Horizon system being rolled out by TACC? add

What has changed in the levels of excitement or anticipation regarding the development and planning of the Horizon system over the past four years? add

What is the nature of the collaboration between NVIDIA, TACC, and Dell? add

What cooling methods are used for the different components in the new IR7000 ORV3 rack? add

bolt Powered by CUBE AI

John Cazes, TACC & Tony Rea, Dell Technologies

search

John Furrier

>> Welcome back. We're at theCUBE's Live coverage Supercomputing 2025. I'm John Furrier, host, with Dave Vellante. We have our whole team here, Jackie McGuire, Savannah Peterson, digging in and unpacking the future of AI infrastructure. Been here for our fourth straight year. It's become the show for accelerated computing, Supercomputing. It's now delivering Supercomputing at scale, high performance computing as it's transported over. We have two great guests here, John Cazes, Director of HPC at the Texas...

John Cazes

>> It's TACC.

John Furrier

>> TACC.

John Cazes

>> TACC.

John Furrier

>> Texas Advanced Computing Center.

John Cazes

>> All right.

John Furrier

>> Can we just call it Texas Supercomputing Center? Welcome to theCUBE and Tony Rea, direct global AI infrastructure lead with Dell Technologies. Fourth year in a row, you guys are back, TACC. You need to look at that name, TACC, Texas Supercomputing. You guys have got the Horizon project, it's all in the news. Let's get into it. What's the update? Because it's a Supercomputing, brand new projects, off the rails.

John Cazes

>> So Horizon is rolling out beginning at the end of this year. We've already started delivery of hardware. We're going to end up with a system with 4,000 GPUs. We're going to have over 4,700 compute nodes. And the purpose of this system is to support science, the open science community in the US. And what this system will provide is an increase in performance and capability of about 10X. This will greatly enable what our researchers can do. And at TACC, we're very excited. This is the largest, most powerful system we've ever built.

John Furrier

>> Can you share the excitement levels? Go back four years, when we first interviewed here on theCUBE, our first year at Supercomputing, go back four years to today, just the level of... I won't say technical intoxication or excitement. What's the vibe like right now? Because you have the keys to the kingdom. You got the jewels.

John Cazes

>> So four years, we were very much in the planning stage. We had just delivered our key system that we run now, Frontera, and we were planning for this system, and we were making decisions. We were deciding who to partner with, what technologies to use, how strong on GPU, how strong on CPU. And we were looking at our current user base and our lead researchers and what applications they're using. And so all of this informed the choice of the Horizon system, which some people may be surprised that it's not all GPU. One of the reasons it's not all GPU is we still have a large number of scientific researchers who use CPUs. We are in the process of porting their codes to GPUs, because the GPUs are so much more power efficient and provide performance than the Sony.

John Furrier

>> I want to go to Dell because, first of all, yesterday in the briefing with NVIDIA, all the academic scientists were there, developers were there, it was a packed house. Dell was featured on the TACC Horizon project and they really took a science angle to this, because now it's developers and computing professionals and developers, but it's also a major lift for the sciences side. Dell was featured in, this is a big part of Dell's transformation with AI factories. So again, which we've been covering like a blanket. What's the perspective on Dell? Has it been more kind of systems design? You've had the relationship with TACC, what's your angle on this?

Tony Rea

>> Well, this is like a triangle relationship, right? This is NVIDIA, it's TACC, and it's Dell working together very closely, cooperatively on a collaboration to advance science and help productivity of research. And that's, fundamentally, that's it at a high level, what we're doing. We worked very closely with TACC and NVIDIA on this design, and actually we have a very long history with TACC, supporting TACC. Right around the corner from us, there's a lot of stuff that goes on in between the two companies, and we also have a very deep relationship with NVIDIA, we collaborate with them also on specific projects.

John Furrier

>> So let's talk about the design considerations, because you don't just order an AI factory online. You got to build this thing-

Dave Vellante

>> Drop ship that factory.

John Furrier

>> So John, what were your specific requirements? What did Dell bring to the table? What did NVIDIA do? Were there other parties involved? What kind of services did you bring?

John Cazes

>> One of the things we got from Dell, and a lot of this was a product of a long working relationship, we've designed, I think five fairly large systems with Dell. We needed a system that we could build dense, right? Because in HPC of nearness of computing matters because of latency. So we needed dense and we needed something a little different than the standard NVIDIA design. The MVL-72. So our racks are twice as dense as their standard MVL-72 rack, 144 GPUs per rack. We also needed something water-cooled because of the amount of power, and an infrastructure to handle the power. These GPU racks are using over 200 kilowatts. So we needed something to support that. And then one of the things my admins mostly care about is we needed an infrastructure that they could manage, that they could manage remotely and easily. And so since we are long-term Dell customers, we're very comfortable with their tools, and they take our feedback as they develop these tools. So this allows the admins to do many things directly and remotely to manage the hardware within the rack. So a lot of these infrastructure concerns are why we chose to partner with Dell.

John Furrier

>> Okay. So you said 200 kilowatts per rack, 144 GPUs per rack and it's all liquid-cooled, it's hybrid?

John Cazes

>> Yes.

John Furrier

>> Air, liquid? Talk about the cooling in it.

John Cazes

>> The racks are almost completely liquid cooled. We have some top-of-rack switches that are air-cooled. The file system infrastructure is air-cooled, but the GPU and the CPU components are all liquid-cooled. And we have just completed the piping and the floor for the data center they're going to go in.

John Furrier

>> Okay, so you built-

Tony Rea

>> This is our new IR7000 ORV3 rack.

John Cazes

>> Yeah, I don't know the model number.

Tony Rea

>> Yeah, that's okay. I'll help you out there.

Dave Vellante

>> Just make it work.

John Furrier

>> Dell's got many models. You got great name-

Tony Rea

>> You got a lot of models. So this is a fantastic design. Very, very dense computing, with busbars per power instead of a lot of power cables. It simplifies managing all those power cables and PDUs. It's a fantastic system. And I've got to say, designing these things is really difficult. We worked very, very closely with NVIDIA on this, but the whole thing about these systems is reliability. We're one of the few vendors that can actually ship systems of this density that are actually reliable and will keep running after day one.

Dave Vellante

>> So you said 10X, the capabilities, are you talking performance, PetaFLOPS? Can you quantify that?

John Cazes

>> Just not petaFLOPS, but performance. We have 11 applications, well we have chosen to accept this system. And so the way this works is we choose applications from the scientific fields we cover, we are partnered with the research team. So unlike regular benchmarks, which are pretty static, you just run it over and over and over. These are science projects and we work with the sciences to improve their applications, port them to the new hardware, and we make improvements to their algorithms. And we've gotten some pretty good improvements in performance. And so the idea, the increase in performance isn't just...

Tony Rea

>> Just warming up.

John Cazes

>> I ran the same application, it was faster. It was, I got more science done, I added some physics to this project. I improved many things.

Dave Vellante

>> You bristled at petaFLOPS. So you're talking about outcomes that you can achieve

John Cazes

>> That's what we're measuring.

Dave Vellante

>> Oh.

John Furrier

>> I was looking at the news release and the three areas where climate modeling, national security, and what was the other one? Scientific genomics. Are those the areas you mentioned 11 applications? Or are they just the highest performing-

John Cazes

>> No, ours is a little different. We've got molecular dynamics, some of it focused on the coronavirus. We have earthquake seismic codes. We have turbulence modeling. We have astrophysics codes, star formation, galaxy evolution. We have some electronic structure codes, solid state codes. Our choices are based on basically the research that the National Science Foundation funds, and the large projects that are currently running on our systems, and the community projects also that they follow.

John Furrier

>> So John-

Tony Rea

>> I'd just like to make a point about what he's saying. It's a very important thing they're doing here at TACC. One of the things they're doing here, before doing brute force computing is like LINPACK stuff, right? But what they're doing is they're actually changing the paradigm of how you design the architecture instead of doing it to run the LINPACK, now we're doing it to run these 11 modes of research.

John Furrier

>> Yeah, and NVIDIA has been pioneering the density side of it, making it more dense. Actually, they got their design to take advantage of density for performance. It's funny, I was talking to Dave on the morning when we were walking in, this show started when I graduated college 1988. And this was an HPC show, high performance computing workstations. If Boeing wanted to design a wing, "Okay, we did." It took a couple days, the members come back and then, "Wait a minute, we're going to make a change." What we saw a couple of years ago, I think Dell, you guys had a presentation three years ago when we started getting better with HPC. Now with full AI in there, they're making changes dynamically in real time to all the calculations. I mean this is like a dream scenario. So you got this AI HPC convergence. Could you guys explain what that means? Because like HPC used to be a category and now AI... Or is it the same thing?

John Cazes

>> So now we talk about AI and HPC as if they're separate things. They're really not. And in five years this won't even, it'll just be data center. We might say AI computing, we might say HPC computing, but it's going to be the same thing. The problems we solve, our researchers solve, the resources we provide, the equipment we have, it's the same resources that machine learning and AI use. Our researchers are gradually bringing in machine learning AI tools into their workflows. Some are completely replacing their workflows with ML. Most are supplementing or complementing their workflows. And right now this is a big thing, it's a paradigm change, but I guarantee you in five to 10 years, this is just commonplace. The ML tools you use every day, you're not going to stop and think, "Oh, I used AI."

John Furrier

>> Yeah, you're still an advanced computing center in Texas.

John Cazes

>> Yeah, it's just going to be there. This is just going to be the way you operate. And that's the future we're headed for. And this is a system ideally set up to support that future and bring our current researchers who are still on the CPUs into the future with the GPUs. And then also, not all of the AI done on the GPUs. There are lots of inference problems that work just fine on the CPUs.

Dave Vellante

>> You've got a lot of CPUs in this deal.

John Cazes

>> Yeah.

Dave Vellante

>> It had like a million, it said, in the press release.

Tony Rea

>> One of the things that's interesting about this system that's also is that the 4,000 GPUs now are available for researchers. Researchers have a lot of trouble getting access to the number of GPUs at that level, which limits the amount of research done. The large, super large systems with tens of thousands of GPUs are not accessible to the researchers. All right? What this system does is allow them to use GPUs at scale to advance their research. So it's going to open up new frontiers for researchers.

John Furrier

>> That's a great point, Tony, because NVIDIA is also saying, with their software, if you look at their libraries and the models they're supporting, it's horizontal. Okay? So it's not like it's a research only thing, it's helping everybody because they get the benefits. That's what I was saying, kind of comparing over the top with comparing the wing design, but the progress that's been made to unleash more creativity and research, and science...

Dave Vellante

>> Right. That's what it's about.

John Furrier

>> How many researchers are we talking about that will have access to?

John Cazes

>> So for this system, we will probably end up with a few hundred projects. So in HPC, we usually have what you call a capability system, which is powerful, tightly connected, large in a capacity system. So Frontera, our current system, and Horizon are capability system. So we'll have probably maybe 40 projects that are really large, then another 40 or 50 that are sort of mid-size, and then we'll have some smaller-sized projects. And these help out a lot because they keep the system busy. Then we have other systems, our other system, Stampede III, is a capacity system. So these are researchers that need a smaller amount of resources. But on Horizon, there will probably be 40 projects that will run at the scale of half to... A quarter to half the sides of the system, and then special times full-size of the system.

Dave Vellante

>> So for this capability system, the mix of CPUs to GPUs, a lot of CPUs, is that to help bridge the researchers?

John Cazes

>> It is to bridge the researchers. Out of these 11 projects I mentioned, eight are currently running using GPUs. I have three that are still CPU only. Of the three, two are actively porting to GPU now. But they get great performance on the CPU only the arm chip, the Graces that we're running on, we don't have VeraVera yet, they're a great chip. We were really worried about this transition to ARM. We've never run an ARM system. The ARM is the chip in my phone, it's not the chip in my supercomputer. But I have to say they're great.

Dave Vellante

>> Yeah, we've seen ARM in the enterprise do amazing things, but okay, and then I want to ask you, so a few months ago we saw the Intel NVIDIA deal with the CUDA, the X86, the CUDA. It's not here yet, but is that another bridge? Is that going to facilitate the transition to the accelerated computing era? Or is it just a press release?

John Cazes

>> I think yes. I mean it won't be part of this system, but I'm sure we'll end up with this technology-

Dave Vellante

>> I think it's going to take a couple of years to-

John Cazes

>> This technology too.

Dave Vellante

>> Yeah.

John Cazes

>> And that's a bit of a different model in that that's a more traditional GPU model with a server and a lot of GPUs, right? Our nodes have fewer GPUs, but also what's important for HPC is our GPUs are tightly connected to the CPUs. And this is where HPC and machine learning AI diverge just a little bit. For HPC, there are a lot of codes that have a lot of synchronization and they need low-latency access to data in memory and networking. And this is where the tightly-bound GPU and CPU helps out a lot. Okay? Because they can share a common memory space and talk directly, and you don't actively have to move data back and forth. This other model, where you have the GPUs on a PCI bus, the overall peak performance is there, but this tightness of connection sort of disappears. The latency goes up and it makes it a bit harder for HPC codes to write applications that are optimized for that platform. This newer, tightly-coupled platform works better for us and it's easier to program for it.

Tony Rea

>> But with that trade-off, that's actually a benefit because customers who don't have the level of power like they have at the tech data center can use these PCI cards to build lower-power machines that go into air-cooled data centers.

Dave Vellante

>> So it gives you that flexibility.

John Furrier

>> I want to talk about access and democratization. We've been seeing this pattern both on robotics, hardware and supercomputing, large-scale advanced computing, the hardware nerds and software nerds who are working on some pretty cool physical build-outs, whether it's data center and scale or say robotics, and that is the idea of collaboration with open source and open science. What you're doing here is opening up, kind of open science model, where the hardware can react to the software, kind of like the Matrix. Like upload my skills, AI's coming from robotics. What does that mean for the large-scale side? So in robotics, we're seeing open source and open academic communities collaborate to take advantage of the robots. So the question I want to ask you guys, how do you see open science and supercomputing? Is there sharing? How do people get involved? What's the vision? Or is this still evolving?

Dave Vellante

>> That's a big question.

John Furrier

>> Well, I've seen a pattern.

Dave Vellante

>> .

John Furrier

>> You've got robotics. Well, it's growing the whole market.

John Cazes

>> So one of our issues, I guess is getting the message out there that our systems are open to researchers all over the country, that basically if you work at an academic institution or a non-profit research institution, you can request access to our resources, the large scale and the small scale. And it goes through a peer review and evaluation, but they're available. And once you get access, it's like asking NSF for money or DOE or DOD for a research grant. You're basically writing a research grant for computing time, but it is available. So for anybody doing research, and it's also available for people partnering with commercial companies. Now we need an academic entity for the request, but you're talking about robot development. It can be academic development or commercial development, but they can get access to these resources to-

John Furrier

>> What's the process?

Tony Rea

>> I just have a point about what he's saying. So they're going to be able to offer fundamental research, high-powered system for fundamental research to develop all sorts of interesting things. What it's going to benefit is at the Edge. A lot of the work here, the research here is going to enable more sophisticated robots and other Edge devices.

John Furrier

>> They'll be hyper converged systems.

Tony Rea

>> Right, right. And so they're going to be an enabler for those technologies. And with this 10X increase in compute power, it's going to-

John Furrier

>> That's a great point.

Tony Rea

>> It's going to allow more advanced research to have.

John Furrier

>> That's our topic at Mobile World Congress this next coming up. So also to call that out, because Dave knows that's my pet peeve.

Dave Vellante

>> Oh, yeah.

John Furrier

>> On the process, just to get a plug-in for people watching, how is the process? Everyone knows writing a grant could be tedious.

John Cazes

>> Okay, so-

Tony Rea

>> I use chat GPT for grants.

John Furrier

>> So does everybody else here.

Dave Vellante

>> Is it a slog, is it easy? You got to know someone?

John Cazes

>> Easiest way-

Dave Vellante

>> Is there a side door?

John Cazes

>> Google TACC Allocations and Google will... I don't give people links anymore, you just Google this TACC Allocations. They'll go to a page that describes how to request a resource grant, and it'll let you request different levels. And so for the highest level grant, the largest allocation, you'll write a 10-page proposal, and it will describe your science and how you want to use the system.

John Furrier

>> So you guys know can make the vetting process.

John Cazes

>> Yeah, because we have to be responsible stewards of the government's money, right? And so we're letting people on usually have experience. Some people have run at TACC for 20 years.

John Furrier

>> Yeah. And if they mention theCUBE, do they get a discount or cut the line? If you mention theCUBE, theCUBE sent you, John and Dave sent us.

Tony Rea

>> You get a discount on computer time.

John Furrier

>> Well, I really appreciate what you're doing. I love the open science. I love what you guys are doing, your mission. I think this is going to get better and faster. I think there'll be more adoption and then hopefully other systems can talk to each other across institutions. So thanks for doing what you're doing.

John Cazes

>> Thank you very much.

Tony Rea

>> Thank you.

Dave Vellante

>> Thank you guys.

John Furrier

>> I'm John Furrier with Dave Vellante. Advanced computing centers or Supercomputing centers. It's just more computing, it's going to be distributed at the Core, Cloud and Edge. And again, the data's got to move around. You're going to have all kinds of new things being enabled. The best is discovery and invention. That's what we love. Thanks for watching.