SC24 | Scott Bils, Dell Technologies

Clips
News
More from SC24

Scott Bils

VP, Professional Services

Dell Technologies

Building out robust AI infrastructures: Networking, data pipeline automation and sustainability at the fore

A lot is riding on the success of today’s artificial intelligence efforts, placing the underlying infrastructural underpinnings under immense scrutiny. From AI networking to storage and compute, the enterprise resource draw is higher now than it’s ever been.Given these facts, how can organizations streamline their infrastructure to maintain sustainable, robust long-term AI operations?“The key to driving outcomes and business value from gen AI is data,” said Scott Bils (pictured), vice president of product management, professional services, at Dell Technologies Inc. “That’s where the role of AI networking becomes so critical

play_circle_outline Challenges in AI networking: latency, network bottlenecks, expertise

play_circle_outline Leveraging InfiniBand and RDMA for scalable AI networks

play_circle_outline Dell's expertise and partnerships in delivering superior AI networking solutions

play_circle_outline Holistic view of AI infrastructure: networking, compute, storage integration

Info
Transcript

Scott Bils, Dell Technologies

Scott Bils

VP, Professional Services Dell Technologies

SC24 is a significant event for HPC and AI infrastructure. Dell Technologies addresses challenges in latency, network bottlenecks, and expertise needed for AI networking. Dell helps organizations utilize low latency, high bandwidth network designs with technologies like InfiniBand and RDMA. They focus on improving outcomes and business value from gen AI by reducing bottlenecks, increasing throughput, and maximizing GPU utilization for model training. Dell collaborates with industry leaders like NVIDIA for advanced AI networking solutions. They assess customer... Read more

explore Keep Exploring

What are some of the challenges organizations face when architecting networks for AI applications and how is Dell Services helping them overcome them? add

What capabilities and services are being brought by NVIDIA and its partners to help cloud service providers with networking challenges? add

What strategies and resources is Dell utilizing to support customers who are not experts in supercomputing or HPC but are looking for AI networking solutions? add

What are some key aspects of AI architectures that businesses need to consider, in terms of networking, compute, and storage integration, as well as software stacks and infrastructure models? add

bolt Powered by CUBE AI

Scott Bils, Dell Technologies

search

>> Hello and welcome to theCUBE's coverage of Supercomputing 24, SC24 as it's most commonly known. We are live from the floor in Atlanta today and have even more coverage coming from theCUBE's studios. SC24 is not just the premier Supercomputing show for HPC, or high performance computing, but also for AI infrastructure nowadays. And with that, let's dive into some of the discussion about how data center networks need to evolve to handle these new workloads. Right now I'm joined by Scott Bils, VP of professional services at Dell Technologies. Hey Scott, welcome on board. And this is a topic that's near and dear to my heart, to put it mildly. And I think, again, when you think about it, data is the lifeblood of AI.

Scott Bils

>> Yeah. No, totally agree. And the key to driving outcomes and business value from gen AI is data. And that's where the role of AI networking becomes so important and critical. When you think about AI networking and the role it plays in data, when you think about clusters and AI architectures, they're really fundamentally different than traditional data center networking. When you think about clusters of GPUs, you essentially want the clusters at a rack level, or even a data center level, to function as a single computer, a single brain. So what that-

>> Sorry. Yeah. I think, again, that's getting into that, into the challenges. And what are some of the primary challenges organizations face with that latency, the network bottlenecks, and again, not even that but just the in-house expertise for the complicated technologies that are there?

Scott Bils

>> Well, it's really architecting the network to reduce latency, improve throughput. You really want your GPUs to be fed data as quickly as possible. You want them to be fully utilized. And with the architectures of AI networks, meaning fundamentally different, meaning you'd connect GPUs to GPUs, and to drive them at scale you really need a fundamentally different architecture, which is requiring a different set of skills a lot of organizations don't have today, whether that be InfiniBand or RDMA. A lot of customers are needing help in this area, and that's where Dell Services is engaged to help them overcome the challenges and the issues they're facing.

>> I think that's key. And I think it's great that you guys have that expertise. Again, I'm probably one of the few analysts who's actually configured InfiniBand in his life. And I can tell you Ethernet's a lot easier. But still, I think when you talk about RDMA and these new technologies, getting the most out of them can be a challenge. So how can organizations really leverage low latency, high bandwidth network designs and technologies like InfiniBand and RDMA to build scalable, future-ready AI networks? And how is Dell helping with that?

Scott Bils

>> Yeah. We're helping by bringing our capabilities, our new services, and the expertise of our partners as well. For example, NVIDIA, to help them design, implement, and optimize these networks at scale. What we find today is a lot of the cloud service providers, GPUs of service providers are challenged with this. We're helping quite a number of them today with these networking challenges. But as enterprise deployments begin to scale out, they're going to face and are facing similar issues. So helping them think through the overall design architecture not just for today but going forward as they scale out the environment is a big part of the capability we bring. And then the expertise from NVIDIA and our other partners in the space as well.

>> Yeah. Because again, I think what I love about Dell's strategy around networking for AI is the fact that you're bringing together a number of different pieces. That gives choice to that end customer. But one of the things that we talk about a lot with AI is what business benefits can really be achieved. And what is the ROI or improved efficiency or faster AI operations by really adopting these tailored, expert-driven AI networking solutions? Because the ROI is what everybody's looking for. And they're saying, "Oh my god, I've got to build out a new network," and things like that. What are some of the things that you see that help them on the business side?

Scott Bils

>> Well, it's really time-to-value in feeding the GPUs, the models with data and getting the most out of those GPUs from a capacity and a utilization standpoint. That's really where we see a lot of the ROI, is removing those performance bottlenecks, increasing the throughput, and allowing to drive full utilization, which is reflected in model training, model entrancing, the outcomes that really matter to the business. So it's a critical enabler of ROI, getting the most from your infrastructure and making sure you're driving the most from a use case enablement standpoint.

>> Yeah. I think that to me is the key, is understanding the use case and making the right network for the right use case. But I think what's nice is, and you kind of hit on this a little bit, but is that you're not doing it alone. Dell's not going alone. Dell's expertise and partnerships with industry leaders like NVIDIA, and integration of technology is like RDMA and InfiniBand make it a leader in delivering these superior AI networking solutions. Really, how do you see all of this playing together as people try to look at... because they're not experts in supercomputing. They're not experts, necessarily, in HPC. Some are, but they're looking for the easy button. And I think, again, you're bringing a lot of partners to the table as well.

Scott Bils

>> We're bringing partners to the table. We're also bringing a full scope approach to it so it's not just about architecting the network, deploying the gear. It's also around the data fabric as well, and SONiC, and implementing and optimizing that. Helping to monitor as well and bringing all that to bear for our customers. And the skills piece is something, yes, we recognize absolutely. There's a gap. That's where we bring, obviously, our skills to bear. But also our learning and education services and training that we do to help upskill our customers around the required technical skills, expertise, certifications required to operate and run these complex AI networks on an ongoing basis.

>> Let's double click on that a little bit, because I think what's good is, like you said, the certifications. And it's not just about always doing it for them. It's about getting them upskilled, in many cases. Because again, even SONiC, which is, I believe, your stuff that's around RDMA and how you do it over Ethernet. But again, they may not have these skills. And InfiniBand as well. How are you really approaching that upskilling as well? And how do you engage with customers from that perspective?

Scott Bils

>> Yeah. I think it's part of our broader professional services philosophy and mandate towards working with customers. We certainly want to help them drive outcomes as rapidly as possible, address their issues. But look, we don't want to be hanging around from a consulting standpoint. We want these customers to be able to support and manage themselves. So a lot of the part of our engagements is the skills transfer. Upskilling them. Making sure they're in a position to support and drive on an ongoing basis. So when you think about how we engage customers philosophically, not just networking but overall, that's a big component of what we bring to bear, is helping them upskill for work on day one, two, and plus in terms of management and ongoing extension, expansion of the network.

>> So when you look at it and you're engaging with customers and they're thinking, hey, we're trying to really look at our architecture, where do you help engage with them first on this journey? Because I think that's important. How do you get started? Because it can seem overwhelming to organizations.

Scott Bils

>> Yeah. A lot of customers we work with, the starting point is just doing an initial assessment on the environment and overall readiness. We talked about the fact that data center architectures, networking architectures are fundamentally different in the gen AI world versus traditional data center. A lot of the ways we get started is to help do a quick assessment to understand not just from a networking standpoint, but also power cooling, data center overall. How ready are they? What are the key issues that they need to address from an architecture and design standpoint? And how do they need to think about the priorities that they need to go address from a services and support standpoint? From an engagement standpoint. So the first step is really to go and do that quick assessment to understand, hey, how ready are you really for AI infrastructure, and networks in particular?

>> Yeah. And I think that, to me, is really one of the big keys, is understanding. Is part of that as skills assessment of the organization, and helping them understand their gaps and where they play as well?

Scott Bils

>> Absolutely. It's not just the technology, it's not just the overall deployment architecture and how networking fits in, but it's assessing the skills and understanding where the customer wants to be longer term from a skills standpoint. We mentioned that a lot of them need to up skill and train their teams around these skills. There are others that say, "Look, we would prefer to have you go ahead and manage this going forward." So the skills assessment is certainly a piece of that, and that can lead in a couple of different directions depending on the customer's strategy and how they're thinking about ongoing day two operations and management.

>> Yeah, I think that's the key. And I think you hit on it, is not only, how do you build the architecture? Not only how you implement it. But day one, day two. Talk a little bit about how you would get people ready for that day two and help them really from the architecture and transition to owning it.

Scott Bils

>> Yeah. Well, AI networking is a piece of it, but another aspect of AI architectures is that networking, compute, and storage are more tightly and more integrally linked than they have been in the past. You may have seen recent announcements that we've made around rack level integration of those components. Delivering full racks to the customer. They need to think about operations in the context of a different infrastructure model, but then also different software stacks on that as well. So again, AI networking is a component of thinking about that broader AI infrastructure and stack that they need to consider, but it's taking that holistic view and determining what are the key issues. What's different now versus what we've done with our traditional data center?

>> Yeah. No, I was lucky enough to be in Austin and getting to walk through one of your AI data centers, and got to not only... I'm glad I had earplugs because it was loud when the GPUs turned on. But what was really cool from a networking perspective was really the rack design and the thought that went into the cabling aspects of it as well. And I think that goes into your overall approach to, as you guys call it, the AI Factory, as it would be.

Scott Bils

>> Yeah.

>> And does that help simplify a lot of organizations as they get to those day two operations?

Scott Bils

>> Yeah. Because look, they're looking for the easy button for on-prem AI, bringing AI to their data. the AI Factory, the way we configure that and bring that to customers, we have different T-shirt sizes, different bundles we bring to customers to help them get started. But it integrates a lot of the networking components in it already in terms of the hardware configuration, but then also our assessment and design services as well. So yeah, the whole AI Factory construct makes it much easier for customers. It's based on this principle that we want to bring together an integrated package to our customers, including infrastructure and ecosystem partners in our services, and make it easy to get up and running and get their AI use cases underway.

>> Yeah, I think everybody's looking for the easy button. But hey, let's hit on the last word here. I know you have some more places people can go to really dive in on this topic in particular. Again, I love the fact that you're always trying to help upskill and help people understand, because I think there is a lot of education that's needed, especially on the networking side of this space.

Scott Bils

>> Yeah. Certainly you'll see the announcements around our new AI networking services here. Would also point everyone to blogs. We'll be posting, talking through these new services capabilities in more detail by Matt Liebowitz, who is driving our AI networking services on our consulting portfolio team. Great piece that gets into more detail around the what, where, and how. And would encourage everyone to go and take a look.

>> That's great. And I think, again, we'll put that down in the description section as well, a link to that blog. Well, thanks for coming on, Scott. I know we'll talk about more stuff over the time here. And SC is just so exciting. So thanks for coming on.

Scott Bils

>> Yeah, thanks for having me. Enjoyed the conversation.

>> And thank you for watching this segment live from the floor of SC24, and theCUBE's Studios, where I am. And stay tuned for more SC24 on theCUBE, the leader in tech news and analysis.