Akshay Ram, Kelsey Hightower & Eddie Villalba
In this KubeCon + CloudNativeCon North America segment from Atlanta, theCUBE’s Savannah Peterson sits down with Google’s Akshay Ram, Kelsey Hightower and Eddie Villalba to explore how GKE is evolving to handle modern AI inference workloads. They break down GKE’s inference-focused innovations, from the Inference Gateway and GKE Inference Quickstart to new Kubernetes inference API primitives and CRDs that better handle unpredictable LLM request patterns, day-zero performance tuning and accelerator-aware load balancing. The discussion also touches on Google’s work with open-source communities such as Ray and vLLM, and the push toward more hardware-agnostic model serving across GPUs and TPUs. The conversation goes deeper into Kubernetes as an extensible workload API and “infrastructure framework,” where CRDs turn real-world practices like inference into reusable APIs instead of one-off engineering efforts. Ram, Hightower and Villalba share practical advice for teams just getting started with Kubernetes and AI, emphasizing core principles such as understanding resource types and contracts, leaning on community support and GKE quickstarts, and recognizing the late-mover advantage of inheriting a decade of hard-won patterns. Looking ahead to 2026, they imagine a world where inference is “just another microservice,” customers remix Google’s building blocks in unexpected ways, and advanced optimizations from research labs and large AI builders flow quickly down to startups, enterprises and regulated industries at their own scale.