KubeCon + CloudNativeCon 2026 Preview with Mike Barrett
In this interview from KubeCon + CloudNativeCon EU 2026 in Amsterdam, Brian Stevens, senior vice president and AI chief technology officer of Red Hat, joins Robert Shaw, director of engineering at Red Hat, to talk with theCUBE Research's Rob Strechay and Rebecca Knight about the contribution of llm-d to the CNCF and what it means for bringing production-grade AI inference into the Kubernetes ecosystem. Stevens explains why inference — not training — is becoming the critical challenge as enterprises move AI into production, and why CIOs need infrastructure that speaks Kubernetes. Shaw, a maintainer of llm-d and longtime vLLM contributor, details how the project optimizes entire clusters of model servers to handle the explosive token demands of modern agentic workloads. Together they describe an SLO-driven architecture that disaggregates prefill and decode phases, giving IT teams independent control over input processing and token generation. Key themes include the cross-foundation collaboration that made llm-d possible, with core changes flowing into vLLM under PyTorch, KServe adapting its custom resource definitions and the Kubernetes gateway becoming AI-aware. Shaw outlines how enterprises are splitting GPU clusters into two deployment patterns: dedicated monolithic stacks for high-priority workloads and shared multi-tenant model-as-a-service environments where developers across the organization experiment and build. He highlights the roadmap ahead, including request prioritization for interleaving critical and non-critical applications, support for next-generation rack-scale accelerator architectures and the security challenges emerging from agentic patterns. Stevens reflects on how rapidly the landscape has shifted — from every enterprise building bespoke DIY inference stacks a year ago to a standardized, community-driven reference architecture today. From the accelerating quality of open source models to the growing compute demands of agentic AI, both leaders provide a practical roadmap for how Kubernetes-native inference will scale to meet enterprise workloads in the years ahead.