In this interview from KubeCon + CloudNativeCon EU 2026 in Amsterdam, Brian Stevens, senior vice president and AI chief technology officer of Red Hat, joins Robert Shaw, director of engineering at Red Hat, to talk with theCUBE Research's Rob Strechay and Rebecca Knight about the contribution of llm-d to the CNCF and what it means for bringing production-grade AI inference into the Kubernetes ecosystem. Stevens explains why inference — not training — is becoming the critical challenge as enterprises move AI into production, and why CIOs need infrastructure that speaks Kubernetes. Shaw, a maintainer of llm-d and longtime vLLM contributor, details how the project optimizes entire clusters of model servers to handle the explosive token demands of modern agentic workloads. Together they describe an SLO-driven architecture that disaggregates prefill and decode phases, giving IT teams independent control over input processing and token generation.
Key themes include the cross-foundation collaboration that made llm-d possible, with core changes flowing into vLLM under PyTorch, KServe adapting its custom resource definitions and the Kubernetes gateway becoming AI-aware. Shaw outlines how enterprises are splitting GPU clusters into two deployment patterns: dedicated monolithic stacks for high-priority workloads and shared multi-tenant model-as-a-service environments where developers across the organization experiment and build. He highlights the roadmap ahead, including request prioritization for interleaving critical and non-critical applications, support for next-generation rack-scale accelerator architectures and the security challenges emerging from agentic patterns. Stevens reflects on how rapidly the landscape has shifted — from every enterprise building bespoke DIY inference stacks a year ago to a standardized, community-driven reference architecture today. From the accelerating quality of open source models to the growing compute demands of agentic AI, both leaders provide a practical roadmap for how Kubernetes-native inference will scale to meet enterprise workloads in the years ahead.
Forgot Password
Almost there!
We just sent you a verification email. Please verify your account to gain access to
KubeCon + CloudNativeCon EU 2026. If you don’t think you received an email check your
spam folder.
In order to sign in, enter the email address you used to registered for the event. Once completed, you will receive an email with a verification link. Open the link to automatically sign into the site.
Register for KubeCon EU 2026
Please fill out the information below. You will receive an email with a verification link confirming your registration. Click the link to automatically sign into the site.
You’re almost there!
We just sent you a verification email. Please click the verification button in the email. Once your email address is verified, you will have full access to all event content for KubeCon EU 2026.
I want my badge and interests to be visible to all attendees.
Checking this box will display your presense on the attendees list, view your profile and allow other attendees to contact you via 1-1 chat. Read the Privacy Policy. At any time, you can choose to disable this preference.
Select your Interests!
add
Upload your photo
Uploading..
OR
Connect via Twitter
Connect via Linkedin
EDIT PASSWORD
Share
Forgot Password
Almost there!
We just sent you a verification email. Please verify your account to gain access to
KubeCon + CloudNativeCon EU 2026. If you don’t think you received an email check your
spam folder.
In order to sign in, enter the email address you used to registered for the event. Once completed, you will receive an email with a verification link. Open the link to automatically sign into the site.
Sign in to gain access to KubeCon + CloudNativeCon EU 2026
Please sign in with LinkedIn to continue to KubeCon + CloudNativeCon EU 2026. Signing in with LinkedIn ensures a professional environment.
Are you sure you want to remove access rights for this user?
Details
Manage Access
email address
Community Invitation
KubeCon + CloudNativeCon 2026 Preview with Mike Barrett
In this interview from KubeCon + CloudNativeCon EU 2026 in Amsterdam, Brian Stevens, senior vice president and AI chief technology officer of Red Hat, joins Robert Shaw, director of engineering at Red Hat, to talk with theCUBE Research's Rob Strechay and Rebecca Knight about the contribution of llm-d to the CNCF and what it means for bringing production-grade AI inference into the Kubernetes ecosystem. Stevens explains why inference — not training — is becoming the critical challenge as enterprises move AI into production, and why CIOs need infrastructure that speaks Kubernetes. Shaw, a maintainer of llm-d and longtime vLLM contributor, details how the project optimizes entire clusters of model servers to handle the explosive token demands of modern agentic workloads. Together they describe an SLO-driven architecture that disaggregates prefill and decode phases, giving IT teams independent control over input processing and token generation.
Key themes include the cross-foundation collaboration that made llm-d possible, with core changes flowing into vLLM under PyTorch, KServe adapting its custom resource definitions and the Kubernetes gateway becoming AI-aware. Shaw outlines how enterprises are splitting GPU clusters into two deployment patterns: dedicated monolithic stacks for high-priority workloads and shared multi-tenant model-as-a-service environments where developers across the organization experiment and build. He highlights the roadmap ahead, including request prioritization for interleaving critical and non-critical applications, support for next-generation rack-scale accelerator architectures and the security challenges emerging from agentic patterns. Stevens reflects on how rapidly the landscape has shifted — from every enterprise building bespoke DIY inference stacks a year ago to a standardized, community-driven reference architecture today. From the accelerating quality of open source models to the growing compute demands of agentic AI, both leaders provide a practical roadmap for how Kubernetes-native inference will scale to meet enterprise workloads in the years ahead.
KubeCon + CloudNativeCon 2026 Preview with Mike Barrett
Mike Barrett
VP & GM, Hybrid PlatformsRed Hat
Rob Strechay
Dir./Principal Analyst & HosttheCUBE Research
HOST
In this interview from KubeCon + CloudNativeCon EU 2026 in Amsterdam, Brian Stevens, senior vice president and AI chief technology officer of Red Hat, joins Robert Shaw, director of engineering at Red Hat, to talk with theCUBE Research's Rob Strechay and Rebecca Knight about the contribution of llm-d to the CNCF and what it means for bringing production-grade AI inference into the Kubernetes ecosystem. Stevens explains why inference — not training — is becoming the critical challenge as enterprises move AI into production, and why CIOs need infrastructure tha...Read more