Google Cloud: Passport to Containers | Just Another Container: Demystifying Gen AI Inference on GKE | Google Cloud Passport to Containers

Clips
More from Google Cloud: Passport to Containers

Savannah Peterson

Principal Analyst & Host

SiliconANGLE Media, Inc.

Poonam Lamba

Senior Product Manager, GKE AI Inference & Stateful Workloads

Google Cloud

Eddie Villalba

Outbound Product Manager

Google

play_circle_outline Poonam's experience with Kubernetes spans banking, deployment, and Google GKE team.

play_circle_outline GKE's role in simplifying Kubernetes for serving AI and ML applications efficiently.

play_circle_outline Custom Compute classes and Dynamic Workload Scheduling optimize resource allocation for workloads.

play_circle_outline Discussion of Inference Gateways and how they enhance serving AI models.

play_circle_outline Inference Quickstart tools aid customers in building and deploying AI applications faster.

play_circle_outline Navigating Burstiness and Response Variability: The Interplay of Networking and Storage in Optimizing Inference Workloads

play_circle_outline Excitement for future technology revolutions and the potential impact on everyday life.

Info
Transcript

Just Another Container: Demystifying Gen AI Inference on GKE | Google Cloud Passport to Containers

Savannah Peterson

Principal Analyst & Host SiliconANGLE Media, Inc.

HOST

Poonam Lamba

Senior Product Manager, GKE AI Inference & Stateful Workloads Google Cloud

Eddie Villalba

Outbound Product Manager Google

What does it really take to run generative AI at scale? In this Google Cloud Partner AI Series episode, theCUBE Research’s Savannah Peterson sits down with Poonam Lamba, senior product manager of GKE AI inference and stateful workloads, Google Cloud, at Google, and Eddie Villalba, outbound product manager at Google Cloud, to unpack how Kubernetes — specifically GKE — is evolving to support enterprise AI inference with real-world impact.

Lamba shares how Google is meeting developers where they are, with tools such as the GKE Inference Gateway and custom... Read more

explore Keep Exploring

What has been Poonam's journey and experience in the field of technology, particularly with Kubernetes? add

What is GKE and how does it facilitate resource management for various workloads? add

What is the role and function of a custom compute class in providing resources for customers? add

What features and functionalities does Inference Gateway provide for managing model requests and resource scaling? add

What specific performance target is being discussed for the round trip time, and what support is provided for achieving it? add

What is the significance of networking and storage in relation to inference workloads? add

What insights are provided regarding the current state and future trends of the AI and ML market? add

bolt Powered by CUBE AI

Just Another Container: Demystifying Gen AI Inference on GKE | Google Cloud Passport to Containers

search

Savannah Peterson

>> Hello, Google fans and welcome back to our Passport to Containers series coming to you today from Atlanta, Georgia, the sixth location in this series, and we're going to be talking about one of my favorite topics. We're going to be talking about demystifying gen AI inference on the Google Kubernetes Engine. Without further ado, two of the most fantastic people on the planet to talk about this, Poonam, Eddie, thank you so much for coming to hang out.

Eddie Villalba

>> Thank you. We're super excited.

Poonam Lamba

>> Thank you for having us.

Eddie Villalba

>> Yeah.

Savannah Peterson

>> It's been a journey. I'm really glad you could make it, especially with flight challenges and everything else today. It's so great to show off some of the other offices, some of the rest of the team around the country and not-

Eddie Villalba

>> It's a beautiful view. It's a great set. Yeah, yeah.

Savannah Peterson

>> This was one of my favorite sets. I think this might be one of our favorite sets ever, at least for me. I don't know about Ken. Let's get the ball rolling by talking a little bit about your background, how you ended up in Kubernetes. But before we even get there, I want to know, Bobby and I talk about this a lot. What made you fall in love with tech? Poonam, I'll start with you.

Poonam Lamba

>> Well, that's a great question. I liked math and science, and I did not think that I'm going to end up becoming a computer engineer until high school. And that's when I-

Savannah Peterson

>> That's still pretty early for the record, you had a lot figured out. I was going to say, I don't think I had any idea what -

Eddie Villalba

>> Way ahead of me.

Savannah Peterson

>> Yeah, yeah. Yeah.

Poonam Lamba

>> Before that, it was astronaut. I wanted to join the Army, do different things, but then I met somebody who was going through computer engineering, talked to this person. They inspired me. I went ahead and applied to the school that I really, really liked. I got in and that's how I ended up in tech.

Savannah Peterson

>> That's awesome. What school was that?

Poonam Lamba

>> It's Army Institute of Technology. That's the only school I wanted to go to if I wanted to do engineering, so.

Savannah Peterson

>> I love that.

Eddie Villalba

>> That is cool.

Savannah Peterson

>> That is cool. Eddie, what about you? When did you fall in love with tech?

Eddie Villalba

>> A very circuitous route. So, no much later. So, I was a full-on, I wanted to be in the military. I went to the Air Force after high school.

Savannah Peterson

>> Thank you for your service.

Eddie Villalba

>> Thank you.

Savannah Peterson

>> Yeah.

Eddie Villalba

>> Did nothing with computers as a career, but my dad had always had... He's a techie guy at heart, so we had Commodore 64s and 128s, and I was always on a computer at some point in time through my youth. So, I was always around it. And then I decided that something I wanted to do when I was still in the military. So, I got my degree in computer science while I was still in. And then when I got out, I was like, "Well, I can be an air traffic controller or I can do this." And at the time I just like, "You know what? I really want to do something in technology." And that's how I started the process. But yeah, it's like my second career, I guess, for lack of a better term.

Savannah Peterson

>> Wow, okay. So, you might've been controlling the plane that got us all here-

Eddie Villalba

>> Hopefully, yeah. Yeah....

Savannah Peterson

>> versus-

Eddie Villalba

>> Yeah, no, yeah. Gosh, yeah, that's when it was a-

Savannah Peterson

>> Isn't it wild the different lives we could have lived? I mean, just pondering that.

Eddie Villalba

>> Absolutely. Yeah, yeah.

Savannah Peterson

>> Yeah. Well, you'd be just as in demand as the technologist and an air traffic controller in this day and age and everything that's going on. So, how does this lead y'all to Kubernetes and AI within the technology world? Poonam, I'll start with you again.

Poonam Lamba

>> Yeah, so I started as an engineer building applications and eventually platforms for a large bank. And the technology goes through a whole bunch of evolution. So, you may be writing a piece of code and you may be running it on your own CPU, and you then run it into a virtual machine. And then there was this whole phase of platform-as-a-service and you write code in all of those evolution of software. And eventually seven or eight years back, I was working on evaluating Istio, the service mesh for organization. And that was the first time I actually got my hands on Kubernetes and really, really loved it. And since then, I've worked on different areas of Kubernetes, including being part of team, which was building a private cloud for the bank using Kubernetes. And then I joined Google and ended up in the GKE team, so more Kubernetes, I guess. Yeah, that's how I ended up working on Kubernetes. But I do want to say that at a fundamental level, if you really internalize the engineering principles and you're comfortable building code and then you understand how to run that code at scale for a critical business, I think the evolution that we go through is just easier. It's just understanding how things will change as you move from one platform to the other. But the underlying, the principles for software engineering, they more or less stay the same. And I think we are going to talk more about that.

Savannah Peterson

>> Yes, we absolutely. That makes a lot of sense. So, you were outside of the Google Kubernetes Engine and have been absorbed into the fold. I'm not surprised. I love that. I can see why. What about you, Eddie? How'd you end up in Kubernetes land?

Eddie Villalba

>> Yeah, that's a good question. So, my background has always been in building data center, very complicated organization environments and looking at it from a distributed nature, or building very large-scale applications. And I was working for another cloud vendor at the time when this thing called Docker came about. And I was like, "Okay, I need to investigate this a little bit further. And then the quick evolution from there to Kubernetes, and then I kind of went in literally head first. I just like, "Okay, I want to start contributing. I want to be on the release teams. I want to know everything there is about it." And then at the same time, I got very lucky. A mentor of mine who was working with me at the cloud provider I was in was also, he was at Google before, and he was actually one of the co-creators of originally Kubernetes from Borg. He's in the papers and everything about it.

Savannah Peterson

>> Yeah.

Eddie Villalba

>> So, he and I, he kind of brought me under his wing. We co-wrote Kubernetes Best Practices for O'Reilly Press together and a couple of other peers of mine. So, I gained a lot of experience very quickly from the people that kind of established Kubernetes as the de facto container platform. And I was very blessed at that. Like how quickly can you ramp up to something so complicated unless you learn it from some of the people that made it, right?

Savannah Peterson

>> What a wonderful experience for you honestly?

Eddie Villalba

>> Yeah. For sure. And then coming to Google was like, "I just got to go back to where it was born," and then meet the rest of the team that did an amazing job. And a lot of them, thankfully are still here. And it is really been a pleasure for sure.

Savannah Peterson

>> Yeah, it has been. And you just talked about an interesting story with how this has become so ubiquitous.

Eddie Villalba

>> Yeah.

Savannah Peterson

>> And I feel like... I got into the Kubernetes world five years ago, a little bit more than that, and there was still this toggle between Docker, other tools are rebuilding on Kubernetes. What are we doing with containerization and how are we deploying this, and what's it going to do at scale? And I feel like we've really tipped over in the last 18 months or so where it feels like if you're building and you're thinking particularly about AI at scale, it's Kubernetes and it's cool. I mean, you made that bet a while ago and it's come together for you.

Eddie Villalba

>> Yeah. I think the bet was one of those things that a lot of organizations had to do a leap of faith.

Savannah Peterson

>> Mm-hmm. Totally.

Eddie Villalba

>> But the principles were sound, right? I think that the-

Savannah Peterson

>> And a leap of complexity in a lot of circumstances, frankly.

Eddie Villalba

>> Yeah, yeah. But it solved a lot of the problems that organizations were kind of faced with at the time. And now, if you think about AI, I hate to say AI is just another workload. It is workload, but specialized, right? And if you think about it... And then there's obviously a couple of different sides of AI, but we want to talk about serving inferencing where the end users actually use the product.

Savannah Peterson

>> Yeah.

Eddie Villalba

>> I think that's where we start seeing how Kubernetes really fits that really well. Yeah.

Savannah Peterson

>> Let's unpack that a little bit because we're headed there anyways. Inference, a term I don't think we're hearing enough quite frankly, but the thing that really makes AI real and for the rest of us in that experience, but it really is just another containerized serving workload. Talk to me about that, Poonam.

Poonam Lamba

>> So, let's ground ourselves in the real-life example. I am going to school, so I'm going to pick that example.

Savannah Peterson

>> Perfect.

Poonam Lamba

>> Let's say there is a student named Alex. They go to school, they read books, practicing quizzes, papers, and that means they're consuming a lot of data and a lot of information. And you can compare it to how AI models are trained roughly. And then one of these days, Alex has to go and write this exam and they get a question which is not something that they had seen before. So, now Alex is not going to run and hit more books or go back to a school. They're actually going to use the knowledge that they have gained through training themselves, and they will use that and they will answer that question. And you can think of it as inference. And now, let's map it to a system that you would build to do the same job.

Savannah Peterson

>> Right.

Poonam Lamba

>> Let's say you have trained a model. Now, you will take that model, the configuration that you need to run that model, the libraries, the runtime environment like TensorFlow or PyTorch or JAX. You will package all of these things into a container, and now this becomes a portable unit that you will take from your testing to production. And now, if you are a application developer who has worked on Python or Java or any other language, you are already thinking that that's exactly what I do when I'm building any web application. So, I want to say it is not different. There are of course some differences. It could be large and you could have to do model compression. It uses accelerators and all of those things. But at the foundational level, the core concepts still stay the same. It's an application, which has been trained, and now you're getting some requests with some new data that you may or may not have been trained on, and then you're getting a response back. So, in a nutshell, that is inference, and it is very similar to the applications that we've been writing in regular web app or microservices world.

Savannah Peterson

>> Yeah, I love that you used it... I love the analogy of a human learning because I think we're very much in a space where we're separating humans and machines and it's getting a little blurry and a little weird, but it is just like us learning and having the ability to critically reason in the moment and do whatever's supposed to happen next, whether that's answer that question or do anything else. So, you talked a bit about these unique workloads. You talked also about how GKE makes these a bit easier.

Eddie Villalba

>> Sure.

Savannah Peterson

>> How does it do that?

Eddie Villalba

>> Well, I mean, I think based on, speaking of what Poonam was talking about, where it's a workload, right?

Savannah Peterson

>> Yeah.

Eddie Villalba

>> With special needs. So, if you think about-

Savannah Peterson

>> Why are those needs so special?

Eddie Villalba

>> Yeah. Because at this point now we're talking about special chips that... Accelerators, the GPUs, TPUs that are so popular nowadays and in the news, it's important to have those for the LM to reason because that's the way they process and that's how they're built on the architecture they're built on. So, I'm going to follow my other mentor, Bobby in his food analogies, right?

Savannah Peterson

>> Yes.

Eddie Villalba

>> Yeah. So, get back to food.

Savannah Peterson

>> Our favorite part of every show.

Eddie Villalba

>> I never worked in the food service industry, but I'm obsessed with The Bear, the show, The Bear and fine dining restaurants and how that works. But if you think about GKE, when you go to a restaurant, even fine dining restaurants, they take in orders. That kitchen has to be able to create a basic bisque all the way to the most complicated Beef Wellington. So, if you think about what GKE is, this is very complicated, very organized kitchen that has all the equipment you need. But when I need to create that Beef, Wellington, I can. When I need to create just a bunch of salad, I can. So, when I need to just serve web services, easy, GKE was already built for that. Now, with all those primitives in the APIs, we built upon that, the accelerator is just another resource and it's another API. So, now Kubernetes was always good at assigning resources to your compute, memory and CPU. Now this is just another resource that we optimize for that workload. So, it's the same kind of concept, right?

Savannah Peterson

>> Yeah. We're already poised to do it, essentially.

Eddie Villalba

>> That's right.

Savannah Peterson

>> Yeah. Oh, man, that food analogy is so good. That's definitely making our next food highlight reel. But I think that's absolutely it, right? I mean, developers, creators, they're builders, they're chefs, they want to be able to do stuff. And I love this term that came up in one of our other segments, this optional complexity, and just the notion of that is kind of nice. Sometimes you do want to be able to adjust or see or do everything, and sometimes you don't. Sometimes you just want it to work.

Eddie Villalba

>> Right.

Savannah Peterson

>> So, let's talk a little bit about custom compute class and DWS Flex Start. Break those down. -

Eddie Villalba

>> I'll go. I'll continue the thread, right?

Savannah Peterson

>> Yes, you're already there, so take us home.

Eddie Villalba

>> Yeah, I think on the same, following on that thread of a very high-end restaurant and kitchen stuff, you're going to have some very high-end equipment in that kitchen, and you're going to be able to... But at some time, like, "Hey, if my salamander broke, I think there's something called a salamander, actually.

Savannah Peterson

>> There is.

Eddie Villalba

>> Yeah, or a sous vide.

Savannah Peterson

>> It heats things.

Eddie Villalba

>> It heats things, right? That's how you make great steaks, right?

Savannah Peterson

>> Yeah.

Eddie Villalba

>> If that broke, what does the chef say? "Sorry, you can't have your steak today."

Savannah Peterson

>> Yeah, woof.

Eddie Villalba

>> No, it's going to happen, right? So, if you think about what a custom compute class does, the custom compute class for us is a way for you as a customer to say, "I need capacity. I need something. I need to get cooking, and I need it whenever at the time I need it." And this guarantees that something will be available to you. It could be either in a smaller CPU or a smaller GPU size or a different GPU size or a different way to acquire that GPU. But either way, what we're trying to make sure is that the customer, especially for inferencing workloads, I call them serving workloads because you're serving up something, you're getting-

Savannah Peterson

>> Appropriate for the restaurant as well.

Eddie Villalba

>> That's right.

Savannah Peterson

>> You're really killing it at on that one. Yeah.

Eddie Villalba

>> Yeah. So, when I'm serving up something, I'm hitting an end user and I need to make their experience happy. So, I need to make sure that those resources that that needs are available at all times. So, custom compute classes are a way for our customers to get the capacity they need when they need it in a priority order that they decide, but also sometimes in the most equitable fashion. It's like, "Hey, I want to go to what I've reserved. Or maybe I can fall into spot which is cheaper, but it's not always there if I don't need it, but it's okay for a bursty workload or something of that nature." So, that's what our custom compute class does. And mixed with our... We have a new flavor of consumption called Dynamic Workload Scheduler, DWS. So, CCC will allow me to say, "I want on-demand, then DWS, then spot." And let that process failover. So, the chef can say, "Let's put this on a sous vide. Let's put this on the salamander and let's move it around and let's get what we need from where we get it." And that's how it kind of works.

Savannah Peterson

>> Well, and this is unique because first of all, running an accelerator 24/7 is super expensive, can be very hard to have the right power needs at the right time. And usage is bursty. It's not like everything's getting executed consistently. So, how does the GKE Inference Gateway help with that, Poonam?

Poonam Lamba

>> So, traditionally when you run a web application, you have a load balancer, but these load balancer were designed for homogeneous load and stateless sort of routing of the requests that are coming in, which is very different from when you're serving the AI requests like inference requests coming in, because that could be bursty. You may have different sort of requirements where you have a chatbot which is needing sub-millisecond or sub-second response, versus you have a batch inference, which is maybe you can wait for a response to come back.

Savannah Peterson

>> Right.

Poonam Lamba

>> So, Inference Gateway, it is open source, but we have a managed version of Inference Gateway through GKE as well. It's model-aware and it is accelerator-aware and let's break it down.

Savannah Peterson

>> Yeah, I was just going to say, what does LLM-aware mean right now?

Eddie Villalba

>> Yeah.

Savannah Peterson

>> So, yeah, let's go.

Poonam Lamba

>> Yeah. So, what it does is when you are sending requests to Inference Gateway, you can specify the model name. And if you have different models or you have multiple versions of the same model, you can specify all of that in the request body. You can also specify if the incoming request is critical, standard or something that you can drop. So, depending on all that data, Inference Gateway can make a decision to route your request, but there's more. It is also collecting real-time metrics from the KV cache utilization and also the queuing that is happening at model server level. So, it uses all of those real-time metrics, and then it is integrated with HPA, which is HorizontalPodAutoscaler. So, it can then scale your workloads up in case of a bursty request, or scale it down when there are no requests coming in. So, in a way, most of our customers that we meet with, they're looking for faster, better, cheaper ways of serving inference and Inference Gateway actually helps you do that.

Savannah Peterson

>> I mean, it makes a lot of sense and it's really critical. I can imagine the feedback from your community is positive, yeah?

Eddie Villalba

>> Yeah, it's been great. I think the concept is that this whole idea of the transformer model and how the transformer, which is the generation of what most of these LLMs are based on has brought the complexity, and then making it simpler for our customers and our end users to understand how do we optimize for the way that this resource uses different things. And really saying, "Let's perform at the best level." And that's when we said, "Okay, now I can save money and I can save resources and scale when I need it." And that's great.

Savannah Peterson

>> And it sounds like accelerate time to value.

Eddie Villalba

>> Absolutely.

Savannah Peterson

>> If you're able to do everything more efficiently, you can probably do more experiments. You can probably test different things, build faster-

Eddie Villalba

>> And that's our goal. Our goal is to get customers to be faster to market because this is a faster race, and so how do we get them faster to the market? And then once they're in there, how do we make sure that they can accelerate very quickly at the most optimal and performant way?

Savannah Peterson

>> Given the scale of companies that y'all work with, I would imagine you have some playbooks or some quick start guides that can help people and dashboards out of the box.

Eddie Villalba

>> Funny you mentioned that. Yeah, actually-

Savannah Peterson

>> Tell me about those.

Eddie Villalba

>> Yeah. So, we launched it next and previous something called our Inference Quickstart. And basically what we're doing is we're leveraging the power of our DeepMind and our partnership with DeepMind and making sure that they're constantly doing benchmark tests on some of these larger open source models.

Savannah Peterson

>> Yeah.

Eddie Villalba

>> The Gemmas, the Qwens, the Llamas, the DeepSeeks, and we're looking at them across our hardware fleet and getting these metrics. And then we're looking at very specific things that matter to customers that are serving generative AI workloads. So, an example would be, how fast does that bot get an answer to me after I hit enter, right? We call that time to first token. That's a very important metric. And then there's, okay, then if I just get one word back and then I have to wait an hour before the rest of the stuff comes out, that is not a bad experience. So, then there's time per output token, TPOT. So, we have this metric called normalized TPOT, which basically says like this is what the end all, the overall kind of experience the user would have on getting a response back from the serving model. And so, what we've done is we've kind of cemented ourselves on let's look at if a customer wants to dial in like, "Hey, I'm an AI engineer, we're about to serve an application." I tell my platform engineer, "Hey, we're looking for at least 42 millisecond total round trip. That's what we're aiming for." So, we basically give them a little slider or CLI and say, "Hey, I need this much." And we'll tell you, hey, what accelerator you should be using-

Savannah Peterson

>> Nice...

Eddie Villalba

>> and what generation, what model and then from there... So, you give us the model, you give us the performance metric you're looking for, and then from there, we'll give you what accelerator you should do. And then we went one step further. We said, "Hey, guess what? We'll give you a Kubernetes YAML manifest that is already pre-configured with all the right parameters for your deployment, also for your HPA, for your metrics and monitoring."

Savannah Peterson

>> That's a big deal for getting up and off the ground.

Eddie Villalba

>> Yeah. So, think of it like, again, I don't want to go back to the food, but it's-

Savannah Peterson

>> Always go back to the food.

Eddie Villalba

>> Let's go back to the food.

Savannah Peterson

>> Always go back to the food.

Eddie Villalba

>> Imagine having like a... My wife and I at one point, were just so busy, we started doing those take-home meal kits. Imagine having the highest end restaurant Nobu sending me these meal kits, right?

Savannah Peterson

>> I was going to say no, but -

Eddie Villalba

>> Yeah. Yeah. So, a meal kit at home, right?

Savannah Peterson

>> Yeah.

Eddie Villalba

>> But it's a quick start. I don't have to worry about figuring out what's the recipe, what do I need, what do I have? It's all there for me, and then I can tweak it if I need to afterwards for my flavor, for my taste. That's basically what the Inference Quickstart will do for our platform teams and our AI engineering team for our customers.

Savannah Peterson

>> So, I really love this. The audience is probably not familiar with this, but I know Bobby is. So, my mother has been following along with the food analogies and our food sizzle reel and this Food Lover's Guide to AI that I just wrote recently, and it has been... Hi, Mom, by the way. It has been such a beautiful bridge for us to be able to talk about all of this stuff because first of all, I mean all good people like food, let's be honest, and at least if you're not probably in my friend group. But it allows us to digest these very complex concepts in a way that is palatable and also easy to wrap our minds around in the same way your analogy about the student and being a student did, and I think that's actually a big part of our job as leaders and shepherds of this technological revolution, is to make sure that everyone can understand. So, never be bashful on the food. I absolutely love it. What about networking and storage? Why is this important for inference? Poonam, I'll ask you.

Poonam Lamba

>> Inference workloads can be bigger and it can be really chatty. You can have a lot of requests coming in, and that is why networking is important, and there is a lot of innovation that is happening in that space, including Inference Gateway that squarely sits into the networking space as well. Storage is another area, which it wasn't cool until the...

Savannah Peterson

>> Okay, a lot of this stuff that we do wasn't cool until.

Eddie Villalba

>> .

Savannah Peterson

>> I'm just going to be really honest. Nerds are having a moment.

Eddie Villalba

>> Yeah. Yeah, it is that moment right now.

Savannah Peterson

>> Right now, it has never been cooler to be us than it is, and I am here for it. I'm living for this moment.

Poonam Lamba

>> So, for this week, I am actually a PM for one of the storage products. It's called as Managed Lustre. We just launched it on 7th of July.

Savannah Peterson

>> Oh, my goodness. Congrats.

Poonam Lamba

>> Thank you. And what it does is it's a parallel store which gives you speed and scale that you need to run petabytes worth of data that you need to train your training workloads or HPA workloads or things like that, or large scale inference that you want to serve. So, I'm just giving you one example, but there is a lot of innovation that is happening in the storage space for KV cache, for checkpoints, for how you want to retrieve and store all this data. And some of it is also related to how a lot of our customers are designing these inference applications. Like I said before, a chatbot needs a sub-second response, but a batch does not. So, you can take an example where you're running a batch inference at scale, and then you're storing that data and when a request is coming in, you're actually serving it from that store. So, there is a lot that is happening in this space, but ultimately the customer needs are speed and scale. And for that, there is a lot of innovation in this space.

Savannah Peterson

>> Yeah, it's been fun. I cover more of the consumer edge where humans meet technology is my bit, and I've been working a lot more closely with our analyst... Hi, Bob, who does the networking side of our practice, because these worlds are kind of coming together through AI and edge and inference all at the same time, where we find ourselves in the same conversation in a really great way where I'm like, "Whoa, I didn't realize how critical the networking piece of this was just because I wasn't in those weeds." But it's been so fun to connect the dots and kind of realize, "Wait a minute, the entire infrastructure of everything has to improve to keep up with this acceleration." And that's where all of our collaboration and honestly why I think GKE is so uniquely poised. I mean, I guess we're sitting here at Google, but I thought this before we were doing this series, but you're so uniquely poised to help be the player in the game who helps people build the coolest stuff because it's built for that. It's designed for that, and it's agile enough. So, for the platform engineers who are watching us right now, if you could tell them anything about GKE or frankly anything that you're working on right now, what would you like them to know? Eddie, I'll start with you.

Eddie Villalba

>> You are safe. GKE is a safe place.

Savannah Peterson

>> Nice thing to say.

Eddie Villalba

>> It's the platform. And more importantly, it's like don't be afraid to get into the conversation with the AI engineers and the ML engineers in your organization and know that we can build a platform that supports not just the regular workloads, but also the major AI workloads. And again, most of these customers who have built large platforms on Kubernetes and on GKE, they've already got that those established processes and procedures and DevOps practices and all that stuff. Now you can take that and now bring it into the serving side, which is basically the same thing that they were doing before, again, but now with fancy hardware and lots of storage and lots of networking. And so, paradigms change a little bit, but the platform is the most important thing at this point in time. Because without a very strong platform, you can train all day because training fails, you start training again. But if my major enterprise app goes down, we've got problems. So, the platform and what they've built, it's important.

Savannah Peterson

>> Yeah, it is important. Poonam, what would you want them all to know?

Poonam Lamba

>> I want to say that there are lots of buzzwords in the whole AI, ML space right now. I want to say that keep an open mind, try the recommended approach that GKE has already written down. Do things and learn, and also remember that you can always map what you were doing to the AI, ML space. Things like designing a platform, running a platform at scale, production readiness, Day-2 operations, SRE, those things, they don't disappear. They just morph a little bit. But the core principles stay the same. So, again, all I ask is try the things that we are building. There are labs and tutorials and quick starts. I think the best way that I learn is just by doing. And when you actually do that, you'll find that it's not very complicated to build applications with LLMs, inference applications or even training workloads.

Savannah Peterson

>> I think that's good advice. And to your point, particularly with Google, but even across the internet, there's a lot of resources.

Eddie Villalba

>> Yes.

Savannah Peterson

>> There's this really nice sense of we got to build better together for this one, I feel like. And plus it's so expensive talking about so much data. It's just a different level of collaboration that I'm seeing across, but the industry as well as academia and the government, it's cool. It's a very exciting time. Like we're saying, it's a good time to be a nerd. It's a really great time to be a nerd. Well, speaking of, what excites you most about the future of where GKE is going and what your customers are going to be able to build as a result? Eddie, I'll start with you.

Eddie Villalba

>> Yeah, I mean, I guess the big buzzword I think for me is agentic AI. And the reason being is my background is dev and distributed systems, I really believed in the microservice world prior to AI becoming a big thing, microservices was the big thing, and Kubernetes was kind of built around that microservice concept. And if you think about what agentic AI really means, it's about... When we started this whole AI concept and we started shipping applications, they were all monoliths. They were literally bunch of Python code in a container that serves, that does everything and it's one big application. And now with the agentic AI, now what we're saying is, "No, we need to separate these, have different concerns, scale them differently when they need to." I'm calling it microservices 2.0. I'm going to sit out there and put that out there because I think it's really how I can have different agents doing different things at different scales and then having-

Savannah Peterson

>> I like that take, yeah.

Eddie Villalba

>> In reality, if you think of the things that we talked about, MCP and the model context protocol and A2A that we just recently gave to -

Savannah Peterson

>> All the acronyms we didn't have six months ago.

Eddie Villalba

>> I know, agent-to-agent protocol, what they're basically doing is making it easier for customers who already took this microservice attitude and built applications in this compartmentalized way and say, "Now I can easily add agents to it and more importantly, make LLMs aware of my APIs and my applications."

Savannah Peterson

>> Yeah.

Eddie Villalba

>> So, it's really making it so it's easier to innovate and make what we already have better and faster.

Savannah Peterson

>> It's going to be cool.

Eddie Villalba

>> I think it's going to be really cool.

Savannah Peterson

>> Yeah. What about you, Poonam? What gets you most stoked?

Poonam Lamba

>> Just want to say never a dull moment.

Savannah Peterson

>> Sitting next to -

Poonam Lamba

>> Well, both, both, both. But I think this space is so exciting right now. There is a lot going on, and I think it will continue to grow, and this whole pie of AI, ML market will continue to grow. And eventually we'll settle down on some sort of architectures or solutions and stuff like that, but I think we are not there yet. There is a lot of innovation that is happening in both inference. For example, distributed inference. We worked with Red Hat and a bunch of other big partners and launched LLMD for Kubernetes. So, that's going to be something that you watch for. And other than distributed inference, I think agentic like Eddie said, is going to be a big thing, and we are all focusing on that. But this whole space will evolve. For example, it'll move from maybe training your... And that's my hypothesis. You do pre-training today, a lot of pre-training, but that moves into you just launch an agent and it learns in its environment and by working with other agents in real-time. So, your training moves from pre-training to inference and training are happening at the same time, like reinforcement learning.

Savannah Peterson

>> Yeah.

Poonam Lamba

>> So, it's a very exciting space and there is a lot that is happening right now, so.

Savannah Peterson

>> All right. So, last question for you both. What do you hope this next technological era, we don't just have to call it AI because I think we'll stop using this term in a while and things will just be operating differently in the same way we did with compute and laptops for example? What do you hope this next technological revolution does for the people that you love and you care about outside of our little nerd fam? Poonam, I'll start with you.

Poonam Lamba

>> I'll give you one example. I was planning a family trip to Lake Tahoe last month.

Savannah Peterson

>> Yes, beautiful place.

Poonam Lamba

>> And usually when I do something like that, I have to do a bunch of research beforehand, Google stuff, and-

Savannah Peterson

>> Lots of tabs open. If you're like me, there's 37 different tabs where I'm going in between different hikes and routes and-

Eddie Villalba

>> Yeah.

Savannah Peterson

>> Yeah.

Poonam Lamba

>> Yeah. And I have kids and I have to take that into consideration as well as other sort of activities that they really want to do. And this time I just went to Gemini and I said, "Hey, I'm planning a trip from this date to this date, Lake Tahoe. I'm going to pick a rental car from San Francisco and then I'm going to drive. Can you please just create an itinerary? And it just did it, and it took me less than a minute.

Savannah Peterson

>> Amazing.

Poonam Lamba

>> All I did is I took a picture of it and I did exactly, I think 80% to 90%, I did the same thing. So, our lives are changing and I think in a good way. One last example, like my nine-year-old son-

Savannah Peterson

>> As many as you want, these are great. I love this. That's why I ask the question. This is usually where we get the best answers.

Poonam Lamba

>> So, I'm working and my nine-year-old son walks into my office and he's working on this reading comprehension and says, "I need more passages like this to practice." And usually it's something that you will start working on and it'll take you hours. You will figure out what to buy, which book to purchase and all that. I went to Gemini and I asked, "Hey, can you generate passages for nine-year-old and some questions with some answers that I can later compare his answers to." And it was done in minutes and I just printed that, gave it to my son, and off you go, so-

Savannah Peterson

>> It's honestly just a couple of seconds. I was using Gemini, this is nerd story. I was using Gemini in the car, driving from Chicago to Des Moines with my boyfriend who's studying for his level three master somm test. And I mean, I'm not the best person to quiz him. I love wine, but I'm not at that level. And oh, maybe someday you never know. But I was able to do exactly what you said. I was able to feed in questions, sample questions they have on the test, and then have him build them out different quizzes about different sections, different wine regions, different cocktails, and apéritifs and it was so fun, particularly to watch someone who's not as much of a... Well, he's a wine nerd, but he's not a nerd like we are in techlandia. But to watch him get so excited about the tool to see what was possible, and he is like, "Oh, my God, I'm going to show all my other somm friends and I'm going to show the restaurant and I'm going to do all this stuff." And I got to... We don't always get to, but they're so nice when we do, I got to see that light bulb moment when all of a sudden everything changes.

Eddie Villalba

>> Yeah.

Savannah Peterson

>> And even for you as a parent being able to do that so much faster and easier, it's not like you have time to be doing your day job in the car, on the way home, writing out questions for him. You need these tools make us can, they don't have to, but they can make us much better and more efficient versions of our ourselves, which is very cool. What about you, Eddie? What do you hope it all does for the people you love?

Eddie Villalba

>> My story is very similar actually. I was asked to speak at my daughter. She was in kindergarten last school year. She was in kindergarten, and they did a career fair, and I'm like, "I got to talk to kindergarten about AI and Kubernetes. How am I going to do that?

Savannah Peterson

>> Love this challenge, I'm living for this, want to be a fly on the wall, really.

Eddie Villalba

>> So, I got in there and what I did was I created a Gem on Gemini and I pre-trained it and told it like, "Hey, you're speaking to a group of five-year-olds at elementary school."

Savannah Peterson

>> This is so cute.

Eddie Villalba

>> The name of the elementary school, their mascot, everything. And then I walked in there and then I used the Gemini and I put it on the screen and I put the microphone on and each kid came in with a story. Like, "So, let's build a story together." And then I started the story. I want to make a story about a little tiger because the dripping spring's tiger, where I live in Texas and he's starting kindergarten and he wants to learn what he wants to do for a career. Then each kid came up and we used the microphone. They talked into Gemini, feed into this-

Savannah Peterson

>> He's an NLP baby. I love it.

Eddie Villalba

>> And it created a whole entire story and it created a picture of the little tiger with the backpack on at school and using our RVO model. And I had a little kid in the back. He was wearing an astronaut suit because all the kids were wearing what they want to be when they grow up. And he said, "I want to change my job. I want to be an AI engineer."

Savannah Peterson

>> Oh, God. I just got goosebumps.

Eddie Villalba

>> Yeah.

Savannah Peterson

>> I just felt that straight in my soul. I mean, I love astronauts too -

Eddie Villalba

>> Yeah. I want to make sure people understand this is a way for us to be smarter.

Poonam Lamba

>> High school wasn't too early.

Eddie Villalba

>> Yeah. Yeah. Yeah.

Savannah Peterson

>> Oh, my gosh. Oh, I love that. How good did that make you feel?

Eddie Villalba

>> It is the best feeling-

Savannah Peterson

>> Did you have a little cry in the car?

Eddie Villalba

>> At the end of the day, I was like, "Yeah, I still talk to him." I run into his parents at the coffee shop. We have a very small town. I run into his parents at the coffee shop and like, "He keeps talking about your AI talk." I'm like, "Yeah, okay." So, it was really cool. Yeah, yeah. So, as long as the next generation understands that it's going to be different, there's no lie. It's not going to be different. So, we starting to see different, but it's what we do to adapt it and where we can take ourselves and then be part of it instead of being in it, right? It's like, "Hey, be part of it. Build it, and don't just let it happen to you."

Savannah Peterson

>> Yeah. I think that's great advice for just about everything.

Eddie Villalba

>> Yeah, that's true.

Savannah Peterson

>> This is great. Poonam, Eddie. Thank you so much.

Eddie Villalba

>> Thank you.

Savannah Peterson

>> One of my favorite conversations of the series, I'm not supposed to pick a favorite child, but this was really good and we killed the food analogies, everything. I wasn't kidding, but truly, this was a blast. Appreciate you both flying in.

Eddie Villalba

>> Thank you so much.

Poonam Lamba

>> Thank you so much.

Eddie Villalba

>> It was an awesome experience. Thank you.

Savannah Peterson

>> Yeah. And thank all of you for tuning in to this super awesome, absolutely rad. I bet you're not watching anything cooler series that we have here. Google Passport to Containers. My name's Savannah Peterson. You're watching theCUBE, the leading source for enterprise tech news.