KubeCon + CloudNativeCon NA 2024 | Chen Goldberg & Peter Salanki, CoreWeave

Clips
News
More from KubeCon + CloudNativeCon NA 2024

Chen Goldberg

SVP, Engineering

CoreWeave

Peter Salanki

Chief Technology Officer

CoreWeave

Scaling innovation with open source: Kubernetes drives AI transformation

Kubernetes machine learning is reshaping the tech landscape, allowing companies to overcome traditional barriers and reach unprecedented levels of artificial intelligence scalability.This shift toward open-source, cloud-native tools empowers organizations to innovate faster, removing reliance on proprietary systems and creating transparent, adaptable infrastructures. As more businesses adopt Kubernetes, they gain the ability to grow at scale without the constraints of “black box” platforms, driving transformative advancements in AI deployment and operational resilience, according to Peter Salanki (pictured, left), chief technology officer of CoreWeave Inc

play_circle_outline Maximizing CoreWeave's Potential Through Open Source Community and Kubernetes Technology Benefits

play_circle_outline Challenges and solutions in managing AI at scale with Kubernetes

play_circle_outline Emphasis on transparency, collaboration, and innovation in CNCF community

play_circle_outline Unifying AI Training Principles: Transitioning to Cloud-Native Tools on Kubernetes for Boring, Efficient Integration

Info
Transcript

Chen Goldberg & Peter Salanki, CoreWeave

Chen Goldberg

SVP, Engineering CoreWeave

Peter Salanki

Chief Technology Officer CoreWeave

CubeCon North America featured a discussion with Savannah Peterson and Rob Strechay from CoreWeave, who highlighted the importance of the open source community in helping them scale with Kubernetes. CoreWeave, known for their massive GPU infrastructure, emphasized the need for cloud-native technologies to handle a dynamic environment. They focus on providing a seamless experience for customers, allowing them to concentrate on their models without worrying about infrastructure. The conversation also touched on challenges of managing AI at scale, with an emphas... Read more

explore Keep Exploring

What role does open source technology, specifically Kubernetes, play in the success and growth of CoreWeave? add

What are some challenges in dealing with complex accelerated computers in a dynamic environment, particularly when training AI models and running AI inferencing? add

What do you hope to be able to say when we're sitting down in London or in Atlanta next year, at the next CubeCon, that you can't yet say today? add

What are some goals or aspirations in regards to AI training and pipeline development using cloud-native tools on Kubernetes? add

bolt Powered by CUBE AI

Chen Goldberg & Peter Salanki, CoreWeave

search

Savannah Peterson

>> Good morning Open Source fam and welcome to Salt Lake City. We are here kicking off three days of coverage at CubeCon North America. My name is Savannah Peterson. Very delighted to be joined by my favorite CubeCon co-host, Rob Strechay. Rob, good morning. It's so good to be back.

Rob Strechay

>> Morning. I mean, the mountain air, the snow in the mountains.

Savannah Peterson

>> The snow. I know.

Rob Strechay

>> We got to see some snow yesterday. I think it's really great to see the community back together again. I mean, it feels like just yesterday Paris was going on.

Savannah Peterson

>> I know.

Rob Strechay

>> Now we're all back together again.

Savannah Peterson

>> I know. I just want some baguettes and vino after this, now that you just brought that up.

Rob Strechay

>> I know, well...

Savannah Peterson

>> Thanks for planting that seed. Anyway, we have a absolute OG Cube veteran, Hen and welcome back to the show and Peter, first time on the show, both from CoreWeave.

Peter Salanki

>> Thank you for having us.

Chen Goldberg

>> Thank you.

Savannah Peterson

>> We are very excited to have you. Such a buzzy week. We're just talking about open source community and celebrating open source community. What does the open source community mean to CoreWeave and why is this event important for y'all?

Chen Goldberg

>> First of all, we're honored to be here. The first one, opening the day. Thank you for having us.

Rob Strechay

>> .

Savannah Peterson

>> Sort of the VIP right there.

Chen Goldberg

>> I think one of the things that we love about using open source technologies, we talked on stage on the importance of not having black box platforms. From our perspective, what you see is what you get. That's one. Secondly, I think that what we are seeing is that we can really achieve that scale and the technology stack allows us to grow and do what we need to do. Peter, what about you?

Peter Salanki

>> As a company, we would not be where we are today without Kubernetes. We started out building our cloud later than the traditional legacy hyperscalers and we quickly decided if we're going to beat these giants, we need to choose technologies that people are familiar with, where we don't have to force them into proprietary lock-in. We built our entire stack around Kubernetes. We're on Kubernetes and bare metal and it helps us both scale and when customers come to us, it's a familiar interface. They don't have to learn proprietary APIs and that really lets people hit the ground running and has allowed to scale. We simply couldn't do it without it.

Rob Strechay

>> Again, to your keynote and I sat there and listened, it was great. I think part of what was, I think, really great to see is the scale. When you talk about scale, give some numbers. I mean you had a whole bunch of numbers on stage, but what were some of the numbers you had there?

Peter Salanki

>> We run a lot of GPUs. We have hundreds of thousands of GPUs deployed. Most of them are the Hopper generation GPUs. We have around 28 data centers live now I think and 10 of them in construction. We plug in a lot of servers and to do this, we both need to have the right tooling. You can't sit and do this manually or use more traditional legacy technologies like Ansible. We need to build this fluidly using cloud-native technologies, so we can ensemble these systems, build them and evolve them at the same pace as the industry.

Chen Goldberg

>> If you think about what does that mean, cloud-native technologies? Cloud-native technologies allows us to deal with how dynamic the environment, the elasticity as well and I think that's where really we enjoy Kubernetes. Maybe the other thing that I think is unique with our scale is that we are creating an experience to our customers that help them deal with that scale, so actually we are doing that for them. We were just having dinner yesterday with one of our customers and they were saying, you do so much for us, which is cool.

Savannah Peterson

>> That's a nice thing to hear-

Chen Goldberg

>> Yes....

Savannah Peterson

>> and a different type of gratitude from community than you normally hear. It's normally haggling around price or something else and I think this is interesting, I want to hang out on the scale conversation for a second. Kubernetes, generally our programs and applications at scale. AI makes that a whole different ballgame. What's the difference there in terms of managing at scale and how do you help your customers do that?

Peter Salanki

>> There's a couple different components there, tying back to what we spoke about in our keynote. First of all, when you think about traditional Kubernetes, maybe you run a cluster with like 32, 100 VMs in a public cloud and those run on CPUs. CPUs don't tend to break. You run your databases, whatever. It's a fairly static environment. You leave it, you come back a day later, things look the same. When you deal with these more complex accelerated computer, they're all interconnected. First of all more things that can break and then the scale is much bigger, so we need to both deal with a dynamic shifting environment, where we need to do content health checking and embrace failures because we know they're going to happen and we kind want them to happen. When we train AI models and we run AI inferencing, it's completely different use case than running a database or a website for a bank. We don't want to make a trade-off where we're leaving 50% of the performance on a table for resilience. Then you lost out. Then you're going to train models slower than your other AI lab competitor, whoever it might be. We pushed these systems to the max, which is a good thing. We're trying to squeeze everything out of them, but then things are going to break and marrying this up with how Kubernetes was originally used has been interesting over the last couple of years. We spent a lot of time using open source technologies like Argo Workflows. How do we use that to do active health checking? How do we build our own controllers that plug into Kubernetes, using all the regular constructs, writing our own CRDs, but it's still Kubernetes, to handle failures, handle life cycle, which has been really exciting. Then we have the scale part as well, which thankfully Kubernetes has also matured a lot over the last couple of years, so that running clusters with 1000 nodes, no problem, piece of cake. Running them with 10,000 nodes, not as much piece of cake, but there's a proven way to get there. You just scale your things correctly, not DDOS your API servers with bad code and you're going to be fine.

Rob Strechay

>> Yeah. I think to me that is one of the key differentiators that you guys bring to the table for this and for AI, but I am sure you talked to a lot of customers and a lot are still trying to get from proof of concept to production and at scale. How do you see that and how are you helping customers move through that life cycle because a lot of organizations are still struggling with that. We're stuck in POC mode for this, for AI.

Chen Goldberg

>> It depends. There are different segments of customers. Definitely when we're talking about some of the AI labs that's creating some new models, they know how to do it and where we help them is like Peter said, get the best out of the infrastructure. Get the most out of it, be the most effective and run as fast as possible. Then for some other use cases of customers and they still very much appreciate, of course, us helping them focus on other things. One of our customers in the health science space, they're using it just to run on some open data sets and just to make some advances in their research, as an example.

Peter Salanki

>> We want our customers to only have to focus on their models, not on infrastructure. With legacy hyperscalers, you need to be an infrastructure expert. We try to take that away from you.

Chen Goldberg

>> Just again, another quote from a customer. I remember we were just sitting and someone said, "We got our cluster and I was able to run a job on day one" and that's their experience.

Savannah Peterson

>> That's a big deal.

Peter Salanki

>> It is a big deal today.

Savannah Peterson

>> That is a big deal. That time to value really matters.

Peter Salanki

>> It is a big deal today. If we zoom out, we look at the market as a whole, we're still very early. This is still running training clusters, running GPUs, accelerated computer scale, it's still ridiculously hard. We are doing well now, but we have so much innovation left and to really broaden AI so every enterprise there can work with it, can run it, where they're not scared like, oh, I need to learn what a GPU is and how it breaks. We as an industry have a lot to do there and there's a lot of exciting innovation and yes, comparing the talks, CubeCon this year, last year and the year before... 2022, there were not talks about GPUs or there was maybe one GPU talk and I'm like, I'm going to learn something from someone else who runs this at scale and they had four GPUs. Seeing how this has evolved and how many people are now thinking about these problems, which are good, but then you look at talks and people are solving the same problems, which is really frustrating. I think we as an industry and we as a community have to come together and align on some patterns, align on some toolings, where being able to run your jobs without it being a huge infrastructure nightmare should be the baseline, so we can focus truly on innovation.

Chen Goldberg

>> I think this is a great point when we think about to production, how we make that technology more accessible, both for the data scientists and the researchers and the platform teams. Maybe again, another thing that we've showed in the keynote is the integration we've done with Slurm, to Kubernetes. This is an example of a pattern of how we can bring the tool that researchers, the data scientists are used to. They don't need to know anything about Kubernetes, however they still have the data they need and their platform team can also manage that at scale.

Rob Strechay

>> Yeah, well, funny enough, the biggest laugh that you got out of it was that the number one feature was SSH. That got a good laugh out of everybody, but I think again, especially for organizations where researchers actually want to SSH in and run their models and do that, but what are some of the key, I guess you could say, security concerns, privacy concerns and regulatory issues that you have to deal with, being that you have a lot of proprietary data for these organizations, like you were talking about healthcare. I'm sure it's across all different industries.

Chen Goldberg

>> This is definitely also, we're still very early on in the journey. A lot of the other regular guarantees and security like encryption and everything are existed the same way. I think one interesting part is today that for some of our customers, the data is really important for them. We actually offer a lot of flexibility in the deployment model. A customer can also have a single tenant environment which is really secured from anyone else and they can really monitor who has access and how, so that's one way. With other customers, actually for some of them the data is not confidential or anything like that. It might be even open data sets, but then there are some other new concerns. For example, they are worried about abuse of the platform. That's one. Discussion we discussed. For example, can someone manipulate the data? I think this is really early, but I'm pretty confident that we will hear more about it on day two, which is security-focused and definitely in the next CubeCon.

Savannah Peterson

>> I think you're right. I think this conversation around how early it is, you have a sense of FOMO and people trying to catch up and not miss out. We're literally day zero of this really I think, at this stage and just getting to adopt. What's your advice to teams or your new customers and community when they come in the door and they're feeling overwhelmed and they're just starting to spin up at this level of scale? How do you guide them through that?

Peter Salanki

>> My advice in general is try not to boil the ocean. Getting something working is more important than it being perfect. We can fine tune all the performance, use all the cool features later, but get something working. Let's get a model training, make sure it's good and then iterate from there. Taking an iterative approach. We see some startups being very ambitious, very aggressive and then you spend too much time getting up and running, chasing that immediate perfection. That's my general recommendation to everyone. Even if you're not looking to be an AI lab, if you're an enterprise, get comfortable, learn to crawl before you walk and then you're going to run. In some cases that means maybe don't go and try to train your own model day one, use an existing one and then you go to fine tuning and there are so many great open source models now, many of them trained on CoreWeave, that you can use to kind of get your use case off the ground, so you don't need to go crazy.

Savannah Peterson

>> Not only is that sage advice in general from a cost perspective, a sanity perspective, getting an early win, it's also much more sustainable. If we have everybody building all of these independent data centers and running all these models, it's an incredible amount of compute. It's an incredible amount of power. I love that you're trying to de-complicate or simplify rather, I guess is the right word for that, the infrastructure side of this so people can just go out and build. Hen, you mentioned earlier some of the customer use cases that you're seeing folks use. Can you give us a few other examples? You mentioned the researchers and yeah.

Peter Salanki

>> Let's see what we can talk about.

Savannah Peterson

>> Tell us all the secrets that you're allowed to.

Peter Salanki

>> Originally everything was text models, right? Then we have some image models. Image models have still been kind of like a side thing, not a main use case. Now we're seeing a lot of shift to multimodal, where the same model taking video, image, voice and you can get the different outputs from it. We have a lot of interesting use cases coming up from the financial sector as well, where they utilize GPUs to do... Can't go into too much details, but both people who do trading type use cases and also simulations. That's also increasing, which is interesting to me because I figured that wouldn't grow the same pace as traditional AI or Generative AI, but that's also exploding in a really rapid fashion. While the marquee headlines are AI, we're seeing similar growth in finance, as we're seeing growth in health sciences, which I think will be really interesting specifically around the impact of the world. That's also super early. It's proof of concept stage. There's some drug research that's going on and a lot of that's going to take longer time because we need to get FDA regulators comfortable with it, but giving the impact this can have on society, that's obviously where I'm most excited.

Savannah Peterson

>> I couldn't agree with you more. I think the healthcare implication, it's going to save lives. I mean when people are-

Chen Goldberg

>> This is super exciting.

Savannah Peterson

>> Yeah. When people are afraid or in their doomer mode about AI, I think that there's so many use cases where it will fundamentally make detection, research, drug delivery and development so much easier. You know it takes $2 billion to get a drug approved by the FDA on average? We had a guess I'm talking about that. Anyway, wild stuff.

Peter Salanki

>> Whole different discussion.

Savannah Peterson

>> Yeah. You're just so inspiring, Peter. You're taking me down. You're taking me down the line. I'm curious. Obviously big community activation for you here, lots of conversations. What are you hoping to get out of the week?

Chen Goldberg

>> I think that, like Peter said before, it'll be very interesting, first of for the Kubernetes community to discuss what kind of constructs and primitives we need to create and make it more boring because right now it's definitely the opposite of that. Even though I said on stage this is boring, it's boring because we invested a lot in tooling. Some of the things we were talking about how the orchestration should look like, what kind of signals and metrics we need to have, the skill that we can achieve. I think that will be a very interesting conversation. The second thing, I can't wait to walk the floor and see... I know the small booth area is usually the most exciting one. This is just day one, but maybe if we chat tomorrow, I will have more to say on that.

Peter Salanki

>> I agree with you. Every CubeCon that I've been to, it's like when you walk around, you meet someone new, you meet some project you haven't spoken to and then we get all these ideas and then we go back, it's like, oh, we should build this or we should work with these people. I'm super excited about walking the floor as well. Also, we're growing like crazy. We need amazing talent to build Kubernetes native ecosystem for AI.

Chen Goldberg

>> Good job, Peter.

Savannah Peterson

>> Great plug. Loving that.

Chen Goldberg

>> I should have said that. That's good. Yes, we are looking for excellent people to join us. Yes.

Savannah Peterson

>> I love it. Shout out to the small project section over there, which are all the earlier stage, CNCF projects for those who might not know. I'm looking in this direction, they are physically over there and I agree with you.

Peter Salanki

>> That's what's exciting.

Savannah Peterson

>> Yeah, it is exciting. It is exciting and there's hundreds of projects too, which is very cool.

Chen Goldberg

>> Again, if we think about what CubeCon and CNCF is all about, it is about enabling innovation.

Savannah Peterson

>> This community is just so helpful. Rob and I always talk about this, one of our favorite events every year, well twice a year because of the culture of collaboration and desire to build transparently and do things that benefit the entire community and the tech world in general. Last question for you. I mean, Hen, you've been on the show many times, Peter, you nailed it, so we'll definitely have you back. Especially with the plug too. There's a little marketer in you. I can see it. You might be the CCO, but there's a little marketer in you. What do you hope to be able to say when we're sitting down in London or in Atlanta next year, at the next CubeCon, that you can't yet say today?

Chen Goldberg

>> That's a good question.

Peter Salanki

>> I don't know it's going to be realized next year, but I want to live in a world where all parts of AI training, AI pipeline, can be done purely using cloud-native tools on Kubernetes. There are still a lot of things coming from a traditional HPC world that we have to integrate with, but we really need to unify around these principles we talked around, where we have one ecosystem that we can all work on together.

Savannah Peterson

>> Love that.

Chen Goldberg

>> For sure. Plus one to Peter. Again, not sure when that will be realized.

Peter Salanki

>> There's a lot of work to do.

Chen Goldberg

>> There's a lot of work to do and this is really early on, but at some point I would love this to be boring.

Peter Salanki

>> Yeah.

Savannah Peterson

>> I love that as a goal. I like that you mentioned that.

Peter Salanki

>> Then we can focus on other exciting things.

Savannah Peterson

>> Yeah.

Rob Strechay

>> Absolutely.

Savannah Peterson

>> Well, thank you for this very not boring interview. However, both of you were absolutely fabulous. Rob, always a treat. I'm so excited for the whole week. It's going to be absolutely wonderful and I hope that all of you are just as excited for our three days of coverage here at CubeCon in Salt Lake City. My name's Savannah Peterson. You're watching theCUBE, the leading source for enterprise tech news.