At KubeCon + CloudNativeCon North America in Atlanta, theCUBE’s Rob Strechay sits down with Red Hat’s Stu Miniman, senior director of market insights, and Robert Shaw, director of engineering, for a ground-level look at how Kubernetes is becoming the production backbone for AI. Shaw explains why most large language model (LLM) deployments are landing on Kubernetes and unpacks the latest on vLLM and LLMD – two projects hardening inference at both node and cluster scale. He details how vLLM maps the fast-moving open-weight model ecosystem (e.g., Llama series and new entrants like DeepSeek) to diverse accelerators from NVIDIA, AMD, Google TPU, Intel and AWS, while LLMD targets cluster-level optimizations such as load balancing heterogeneous workloads and specializing pre-fill vs. decode phases to boost tokens-per-node.
Miniman connects these innovations to what Red Hat is showcasing at the event: trust and security in the AI era (with projects like SPIFFE/SPIRE and KServe), plus hands-on learning at OpenShift Commons. He highlights community stories from industries such as financial services and public sector (with names like Ford, Morgan Stanley and Northrop Grumman) and underscores how platform engineering, observability and hybrid/edge architectures are evolving to meet demanding AI inference patterns. The conversation also touches on cost and performance economics, why hybrid remains foundational for AI training and inferencing, and how Kubernetes, GitOps and CNCF projects are coalescing to scale real-world AI use cases beyond simple chatbots into agentic applications.
Forgot Password
Almost there!
We just sent you a verification email. Please verify your account to gain access to
KubeCon + CloudNativeCon NA 2025. If you don’t think you received an email check your
spam folder.
In order to sign in, enter the email address you used to registered for the event. Once completed, you will receive an email with a verification link. Open the link to automatically sign into the site.
Register for KubeCon + CloudNativeCon NA 2025
Please fill out the information below. You will receive an email with a verification link confirming your registration. Click the link to automatically sign into the site.
You’re almost there!
We just sent you a verification email. Please click the verification button in the email. Once your email address is verified, you will have full access to all event content for KubeCon + CloudNativeCon NA 2025.
I want my badge and interests to be visible to all attendees.
Checking this box will display your presense on the attendees list, view your profile and allow other attendees to contact you via 1-1 chat. Read the Privacy Policy. At any time, you can choose to disable this preference.
Select your Interests!
add
Upload your photo
Uploading..
OR
Connect via Twitter
Connect via Linkedin
EDIT PASSWORD
Share
Forgot Password
Almost there!
We just sent you a verification email. Please verify your account to gain access to
KubeCon + CloudNativeCon NA 2025. If you don’t think you received an email check your
spam folder.
In order to sign in, enter the email address you used to registered for the event. Once completed, you will receive an email with a verification link. Open the link to automatically sign into the site.
Sign in to gain access to KubeCon + CloudNativeCon NA 2025
Please sign in with LinkedIn to continue to KubeCon + CloudNativeCon NA 2025. Signing in with LinkedIn ensures a professional environment.
Are you sure you want to remove access rights for this user?
Details
Manage Access
email address
Community Invitation
Stu Miniman & Robert Shaw, Red Hat
At KubeCon + CloudNativeCon North America in Atlanta, theCUBE’s Rob Strechay sits down with Red Hat’s Stu Miniman, senior director of market insights, and Robert Shaw, director of engineering, for a ground-level look at how Kubernetes is becoming the production backbone for AI. Shaw explains why most large language model (LLM) deployments are landing on Kubernetes and unpacks the latest on vLLM and LLMD – two projects hardening inference at both node and cluster scale. He details how vLLM maps the fast-moving open-weight model ecosystem (e.g., Llama series and new entrants like DeepSeek) to diverse accelerators from NVIDIA, AMD, Google TPU, Intel and AWS, while LLMD targets cluster-level optimizations such as load balancing heterogeneous workloads and specializing pre-fill vs. decode phases to boost tokens-per-node.
Miniman connects these innovations to what Red Hat is showcasing at the event: trust and security in the AI era (with projects like SPIFFE/SPIRE and KServe), plus hands-on learning at OpenShift Commons. He highlights community stories from industries such as financial services and public sector (with names like Ford, Morgan Stanley and Northrop Grumman) and underscores how platform engineering, observability and hybrid/edge architectures are evolving to meet demanding AI inference patterns. The conversation also touches on cost and performance economics, why hybrid remains foundational for AI training and inferencing, and how Kubernetes, GitOps and CNCF projects are coalescing to scale real-world AI use cases beyond simple chatbots into agentic applications.
Senior Director of Market Insights, Hybrid PlatformsRed Hat
Rob Strechay
Dir./Principal Analyst & HosttheCUBE Research
HOST
At KubeCon + CloudNativeCon North America in Atlanta, theCUBE’s Rob Strechay sits down with Red Hat’s Stu Miniman, senior director of market insights, and Robert Shaw, director of engineering, for a ground-level look at how Kubernetes is becoming the production backbone for AI. Shaw explains why most large language model (LLM) deployments are landing on Kubernetes and unpacks the latest on vLLM and LLMD – two projects hardening inference at both node and cluster scale. He details how vLLM maps the fast-moving open-weight model ecosystem (e.g., Llama series an...Read more
exploreKeep Exploring
What are the advantages of using Kubernetes for deploying long-lived services and production-quality applications, particularly in relation to inference workloads?add
What is vLLM and how does it relate to open source LLM models and hardware accelerators?add
What are the differences between LLM inference workloads and standard HTTP requests in terms of their characteristics and scaling strategies?add
What topics will be discussed at the main show on Tuesday morning?add
What are the benefits of community interaction in events like KubeCon + CloudNativeCon, particularly in relation to discussions about technology and open source?add
>> Hello, and welcome to this CUBE Conversation. I'm Rob Strechay, and we're here gearing up for KubeCon + CloudNativeCon North America in Atlanta, the epicenter of what's next in cloud-native innovation. We're seeing AI and Kubernetes come together in ways that are reshaping the entire stack and how organizations leverage platform engineering techniques, gain developer productivity, maintain visibility through observability, get security to zero trust, and leverage edge and hybrid cloud to unlock entirely new development models. To help me unpack the goings-on and what we're going to see in Atlanta and more and beyond, I'd like to welcome Rob Shaw, who's the director of engineering at Red Hat, welcome on board, and somebody you've probably never seen on theCUBE before, Stu Miniman, who's the senior director of market insights at Red Hat. Welcome on board, Stu.
Stu Miniman
>> Thanks, Rob. Great to be here.
Rob Strechay
>> Glad to have you both here. I think that when you start to look at this, I want to take a step back for a second and look at cloud-native as an entirety, because I think the fact that Red Hat has been making a lot of investments, not surprisingly, in this space, and really bringing together AI and Kubernetes. And seeing we have Rob Shaw, who was with Neural Magic, here, why don't we start out with you and you help us understand the landscape from a cloud-native perspective and how these two things are really coming together with AI and Kubernetes?
Robert Shaw
>> Yeah. Well, as I think about it, we've had basically three years now since the launch of ChatGPT, and so there's been a lot of excitement around LLMs and lots of pilots over the course of the past three years, as various different use cases have started to pop up across businesses and consumer applications. And so, where Kubernetes comes into the picture is we're starting to turn a lot of these pilots into production applications, and from what I've seen, almost all of the deployments of LLMs are coming on top of Kubernetes. And so, it really sets the picture for what Kubernetes is best at, these long-lived services, these production-quality applications, all the reliability, scalability, et cetera, that has been built up for running other applications, LLMs can really fit on top of that overall paradigm. And so, we're starting to see lots of different enterprises, startups, et cetera, leveraging Kubernetes to deploy these applications. And so, that's where a lot of the excitement has really been, especially as we've started to see reasoning models and agentic applications, which are very, very, very demanding inference workloads. That's what we've spent our time on at Neural Magic, and now at Red Hat, is hardening that stack, making it as efficient as possible, and starting to look at places where we can tie together inference engines, like vLLM, with the Kubernetes ecosystem.
Rob Strechay
>> So let's explore that a little bit. Why don't you give us an update as a maintainer for vLLM and LLMD, why don't you give us an update on where those things stand?
Robert Shaw
>> Yeah, yeah, sure. So vLLM is a piece of software that's really grown up with the open source LLM ecosystem. As I think back over the course of the past three years, we've really seen a proliferation of open-weight models starting to emerge. The Llama series is a great example of this, over the course of the past three years from Meta, from Llama 1 to Llama 2 to Llama 3 to Llama 4. Recently, we've seen lots of new entrants into this space, such as DeepSeek. These models need a way to run on hardware accelerators, and so that's what vLLM is all about. It's about mapping that whole ecosystem of open source models onto that whole set of hardware accelerators. So we've collaborated very closely with the NVIDIA team to make these models run really fast on NVIDIA GPUs, with the AMD team to make those models run really fast on top of Instinct, obviously with the TPU team from Google to map those models onto TPUs with Intel, and the AWS teams, as well to really allow that whole ecosystem of models to run on top of that whole ecosystem of accelerators. And so, that's really where vLLM sits, it's that single ecosystem project, where all these different model providers can map onto all the different hardware accelerators and really provides that integration point for all of these key ecosystem participants. But what we really identified about a year ago as these models got larger and larger is that there's an opportunity to look at cluster scale, in addition to an individual pod. vLLM maps to a pod, basically, where it's going to be deployed on a single node and it's going to try to make that node give you as many tokens as possible for serving that model. One of the things we identified is there's lots of distributed optimizations, and for LLMs, the first letter is L. They're very large, they're very, very, very compute-intensive, so we launched a project called LLMD, which is really a collaboration between inference gateway API extension in the upstream Kubernetes ecosystem with vLLM to try to look at distributed optimizations that allow us to get more tokens out of an entire cluster of vLLM servers.
Rob Strechay
>> And it would seem that that would tie in very nicely. Again, I went through what the main pillars were for KubeCon, and we'll get to the stuff, because I know there's a lot of exciting customers that are going to be on at Commons and other places that we'll dive into in a second. But when you look at LLMD and stuff like that, one of the big things is edge and hybrid. To me, it would seem like LLMD is almost a bridge for that as well, because if you're going to do inference, a lot of people are talking about how they want SLMs at the edge and things of that nature. Is that where LLMD really fits in?
Robert Shaw
>> Yeah. As I think about LLMD, it's really about a variety of different performance optimizations. I think that's a really critical part of LLM serving is these models are running on really, really expensive hardware accelerators, even small language models are eight to 10 billion parameters, they're very, very, very expensive to generate tokens. And as we start looking at more modern LLM workloads, these can be reasoning workloads, where the model is going off and generating tokens for an entire minute, thousands and thousands of tokens per inference request, or it could be agentic applications, where the model is executing a tool and then the result of that is getting sent back to the model. These are very, very, very, very compute-intensive workloads, so LLMD really takes the idea of looking at a whole cluster of vLLMs. And to really explain why there's an opportunity, to optimize at the cluster level, I think it's important to really compare and contrast the LLM inference workload with a standard HTTP request. The average HTTP request that Kubernetes is amazing at running is really, really fast, it's uniform, it's super cheap, going to take less than a second, every single replica can process the request about equally well. And so, standard things, like a load balancer and a scale-out deployment, is a really, really good strategy for scaling that deployment, because with round-robin, you can just distribute the workload, and in general, you get an even amount of load. If we look at the LLM inference, all of those assumptions basically don't apply. For instance, the average inference request can take 60 seconds, two minutes. You have a huge heterogeneity of the inference workload that you're running. You could have a RAG workload, where it's thousands of input prompts and really, really short generation. You could have a reasoning workload, where you get a short prompt, then you generate for a really, really, really long time. It's very common to have multi-turn conversations, where the requests... If you think about a chatbot use case, you'll go back and forth with the model. The request patterns tend to have this temporal locality, where there's opportunities to exploit these properties. And LLMs as well have a very specific generation phase, where there's two phases to inference. There's the pre-fill, where you're processing the prompt, and there's the decode, where you're generating the tokens one by one. These have actually very, very, very different properties, where the pre-fill is what's called compute-bound. It's bound by basically how many FLOPS you have in the machines, whereas the decode phase is what's called memory bandwidth-bound, it's bound basically by how fast the HBM on the GPUs is. And so, there's a really cool optimization, which is that you can specialize pre-fill replicas and decode replicas, connect these through high-performance interconnects, and really enable you to specialize the vLLM pods for these types of work.
And so, LLMD looks at all of these general ideas. How can we better load balance a heterogeneous workload? How can we take advantage of the fact that there's multi-turn request patterns? How can we take advantage of the fact that the LLM workload has this pre-fill and decode phase, and maybe we want to actually specialize different replicas for different amounts of work? So all of these ideas are things that we know are being used by frontier model providers to squeeze more tokens per node, basically, out of their instances. And our goal with LLMD is to take all of these different performance optimizations, compose them into Kubernetes to make it easy for enterprises and startups to be able to take advantage of those same optimizations when they're going to deploy in the familiar environment of Kubernetes that they're used to.
Rob Strechay
>> Yeah. I think that actually ties in well with... I'm already signed up for Red Hat Commons, but Stu, why don't you help us understand what Red Hat has going on at the show, and how it ties in? Because I know some of the customers already that are going to be on stage, there's a lot of AI going to be talked about.
Stu Miniman
>> Sure. Well, Rob, what Rob just talked about, we reach a certain point in this ecosystem where it's like, well, we've got a certain level of maturity, people said sometimes it's a little bit boring. AI's changing a lot of these things. It's a natural extension, but not only are there all the AI pieces, but underneath, there's a lot of changes that need to happen, because as Rob so well pointed out, the workloads are very different, how long they stay, how stateful they are and everything there. So first of all, at the main show on Tuesday morning, we've got a nice slot, we're talking about trust and security in the era of AI that we have.
Rob Strechay
>> Which is huge.
Stu Miniman
>> Absolutely, yeah. In general, people need that, but you know that that's what you're going to get from Red Hat. Luckily, extension of a few open source projects that have been around for a year, like SPIFFE and SPIRE, very focused on that space. I know you know that well, Rob.
Rob Strechay
>> I know.
Stu Miniman
>> As well as some of the AI things, like KServe will also be highlighted in that environment. You mentioned OpenShift Commons. So part of the Day Zero event, it's not at the main event, it's a little bit further away, but actually near where most of us are staying in the hotel blocks and the like, and OpenShift Commons, of course, is a great place to hear all the customers out there. So we've got financial service, we've got public sector, a variety of geographical backgrounds, and big names, like Ford, Morgan Stanley, Northrop Grumman, all sharing their environment, as well as there's opportunities, if you stay for the afternoon, to get hands-on. So we talked about AI, we talked about cloud-native, but there's actually a hands-on environment for looking at storage and virtualization environment, which has, of course, been one of the hotter conversations we've been talking about. So lots of areas to be able to learn. And of course, when you come to the show itself, Red Hat's going to be the first booth when you walk in, as well as if you go to the project area, there's a lot of Red Hatters supporting them. Rob's one of the key contributors for vLLM, we've got Red Hatters throughout dozens and dozens of open source projects, which of course is the main thing we're talking about at KubeCon is all the open source projects and all that it has benefited users and what they're doing.
Rob Strechay
>> Yeah. To your point, I think the whole community aspect of it, when we talk about... This is, what, our fourth one we're doing of these before KubeCon + CloudNativeCon, and when you start to talk about the hallway track and getting in with people and having these discussions, to your point, I think Commons, that was the great thing for me, being a geek at heart, was being able to talk to the end users actually using stuff, going and saying, "Hey, this is why I went down the vLLM path, because I have more than just one type of GPU in my environment and I need to be able to have some commonality and this is why I want to work off a common platform, which is why open source is so great."
Stu Miniman
>> Yeah. Well, Rob, one of the things, when you go to this show, it's like, okay, how much time do I spend hanging out in the hallway and wandering booths and talking to people? One of the things I love about Commons is most of the sessions, we leave time for Q&A. So you want to talk to all those users and get out and... Oh, by the way, if you can't make it, we're going to put them all up on YouTube. At the end of the main stage, we put a bunch of our folks from the Red Hat side up on stage to do AMA. I'll be up there. It's always a question that comes out, got our lead technical engineers and product managers, but sometimes it's something I can help with. But really good way to interact with people and it's a good start to get you as to, okay, where do I want to spend time at the show and dig deeper in some of these technologies? Because at the main show, a lot of times, if you want to speak to the speaker, it's like, okay, they finish, everybody queues up, do I actually get to see them and talk to them? So yeah, as you said, community is something we know pretty well at Red Hat, and getting to interact and actually dig a little bit deeper is a great opportunity.
Rob Strechay
>> Yeah, I would say so. And I think one of the things that's been exciting is that there has been so much coming together around the Linux Foundation in general and CNCF and some of the other parts of the Linux Foundation. Rob, from your perspective, how do you see all of this as you look at AI and open source? I'm very bullish on it. I have a funny feeling you're very bullish on where it's going. What do you see as the next 12 months with things like vLLM and LLMD and stuff like that? You threw another acronym out there, LLVM, which now is going to be stuck in my head forever.
Robert Shaw
>> Well, hopefully, we'll try to reduce some of the alphabet soup we have going on. But in general, one of the things that I'm most excited about is the quality of the models themselves continues to just get better and better and better. We've seen huge innovations over the course of the past three years. I saw a chart that showed ChatGPT when it first came out versus what it's like now, and it's just stunning the amount of progress that has happened over just a very, very short period of time. And we're seeing that continue through open source models, where the reasoning capabilities that have been added to some of the proprietary models are starting to show up in the open source models. We're seeing really, really large open source models be released that really have that frontier capability. So that's one thing I'm really excited about is just the continued improvement in the overall quality of those open source models, which is what actually unlocks the business use cases and consumer use cases that is powering all this frenzy. So I think that that's one really, really great trend. And then, I think the second thing is companies and consumers alike and startups alike have been searching for use cases over the course of the past two years. In earnest, I think a lot of those are going to start getting to scale. Companies are really starting to think through what does their inference platform look like. I've talked to a lot of customers, many of them are running on proprietary APIs. I'm starting to see lots of companies starting to, however, plan for the future, as these open source models get better, as the costs start to rise, and they're really starting to think through, what does my platform look like, what is the foundation of the next 20 years of my private LLM cloud, so to speak, that they'll be running. And so, that's really why I'm so excited about LLMD and vLLM. I really think that, these two technologies are going to be really fundamental to what these platforms look like. vLLM is that Rosetta Stone, that takes this huge ecosystem of models, that gets better week on week, it takes that huge ecosystem of accelerators that gets better and better, generation on generation, and maps those together, and LLMD is the thing that allows that to compose with Kubernetes. I think those are going to be the foundational projects that really drive that next layer of platform that's really being built as we're speaking now. So yeah, I couldn't be more excited about the future. I'm also really excited about the new accelerators that are entering into the market. It's been great working with the Google TPU team to bring support into vLLM. We've been working with AMD for a long time. We have some really new, exciting architectures, like the Blackwell NVL72 machines, which have some really cool networking associated with them. So there's certainly lots of new problems for us to work on in vLLM to make sure that we're taking advantage of the amazing innovations going on at the hardware level. But it's just really such a vibrant space, both at the model and the hardware layer, that just makes really, really interesting technical problems in the LLM and LLMD, and I think it's just really interesting to see the beginning of these platforms starting to emerge inside of customers. And so, it's really those two things together, it's super exciting technical problems going on, and then really starting to work with customers who are building their open source model serving stack seriously. I just think it's going to be an amazing next few years.
Rob Strechay
>> Yeah. I think, again, it's one of those things that I'm excited, because I think what gets me most excited about getting there is, and you kind of hit on it, but I'm sure that Red Hat's got more than just Commons going on with a lot of the customers, because I know some of them will be on theCUBE later in the week, which is fantastic, we always love that, because it's where the rubber hits the road, or if they're in aviation, hits the runway, when you talk about it. I think part of it is that Red Hat is going to be going across all those pillars as well. There's security, there's the platform engineering stuff that you're doing, there's all of the different pieces within making observability more accessible with some of your partners and things like that. So what else you got going on?
Stu Miniman
>> Yeah. So Rob, there are so many things. I said to you, we actually have a website that lists, first of all, even at Day Zero there's the... They renamed cloud-native security column something that I can't quite remember. There's all the backstage activity, of course, which is platform engineering there. There's some of the AI pieces, where we will have additional Day Zero activity coming on. And then, we have way more sessions that we are presenting at than we have time to go in here. So just please, just go search Red Hat KubeCon 2025, you will find a landing page which has everything, including a link back to theCUBE, of course, where we're going to have... I'm always thrilled, Rob.
Rob Strechay
>> And we'll put the link in the-
Stu Miniman
>> A couple of customers on there. What's interesting, Rob really talked about so many customers are still trying to sort out their use cases, as well as a number of customers I talk to that just, "Oh, I figured it out, but the economics didn't make sense." And the things that we're working on is underneath, Kubernetes is reliable, secure, scalable and the like, but some of the additional changes that we're making can really help those economics to get better usage. We had one of our channel partners, I introduced a vLLM, and he called me up the next week, he was like, "Oh my God, I doubled the amount of LLMs that I could run on my small lab environment." And of course, that just totally changes the economics on whether that use case is viable and the like. So working through some of the early use cases, and as more customers are going out to much broader scale production, that's really where Kubernetes come in, and there's a lot of work, not just at LLMD, but all the Kubernetes pieces underneath, how do pipelines and GitOps play into this new world, there's dozens of projects there. So we know the CNCF has been great at pulling in a lot of these projects to make sure that they work there, because the infrastructure and the applications have to play well together in this fast-moving space.
Rob Strechay
>> Yeah. One of the ones you missed was Cloud Native + Kubernetes AI Day, which... How can we forget that?Rolls off the tongue in Kubeflow, which is another project that's awesome as well. I think, again, when I look at this, the amount of people that are going to come together in this entire space down there in Hotlanta, it's going to be .
Stu Miniman
>> Rob, probably not snow this time, unlike the people that went to Salt Lake last year.
Rob Strechay
>> Yeah, yeah. Walking in to get there in the snow, That was an experience. I'm glad for hot coffee that was across the street from my Airbnb that time. This has been great. I think, again, when you start to look at all of this coming together, all of the AI, there's so many moving pieces, as you were saying, it moves so fast. I think, again, with this week, our friends at NVIDIA having their GTC and everything and ramping into there, it's going to be a month of AI. The next month is just going to be a lot of fun.
Stu Miniman
>> Rob, haven't we been in like three years of AI?
Rob Strechay
>> I know, but you get to a point where you're sitting there and you're like, how does it make it... To your points, both your points, I think why I love KubeCon in particular, and CloudNativeCon, is the fact that a lot of end users are coming there, not only with their problems and what they're trying to solve for, but the solutions that they've found that work for specific that you can get ROI out of your LLM, how you get beyond just the chatbot, you get to more agents that are using traditional AI and analytics and other things with it. So I'm excited for it, I get jazzed up about it pretty easily. Thanks for coming on board, I think this has been great. Any last words? No?
Stu Miniman
>> Yeah. It's a big ecosystem out there. If you're going in person, we even have a party that you can sign up with our friends at AWS, because it is hybrid, it's really interesting. It's been pretty exciting for those of us especially that have a hardware background. There were a few years in the cloud era that it was like, does hardware matter? More than ever does it come on. And at Red Hat, hybrid has been our drumbeat for more than a decade, and AI inferencing happens a lot of places, where do I do my training? So hybrid is definitely the reality, and AI more than ever is putting that in the forefront. So we're super excited out there, working with our customers and our partners to help make all these a reality.
Rob Strechay
>> And I have a funny feeling, we haven't said the word Cooper yet today, but I'm sure Red Hat virtualization-
Stu Miniman
>> virtualization.
Rob Strechay
>> Yeah, I know. But it plays into the hybrid thing as well, and edge, and how people are transitioning a lot of their workloads.
Stu Miniman
>> Yeah, absolutely. Rob, the conversations I've had with customers for the last two years has been, if I have a problem on my virtualization side of the house and if that's going to cause budgetary issues, I don't have the bandwidth of my people or the dollars to be able to spend on this AI stuff, and if you don't do that, your competition's going to leave you in the dust.
Rob Strechay
>> Absolutely. I think that's a great way to leave it. So Rob, thanks for coming on board. Stu, as always, thanks for coming on board. And thank you for watching. I look forward to seeing you all at KubeCon + CloudNativeCon North America in Atlanta for 2025 on theCUBE, the leader in tech analysis and news.