We just sent you a verification email. Please verify your account to gain access to
theCUBE + NYSE Wired: Zero Trust Cyber Series. If you don’t think you received an email check your
spam folder.
Sign in to theCUBE + NYSE Wired: Zero Trust Cyber Series.
In order to sign in, enter the email address you used to registered for the event. Once completed, you will receive an email with a verification link. Open this link to automatically sign into the site.
Register For theCUBE + NYSE Wired: Zero Trust Cyber Series
Please fill out the information below. You will recieve an email with a verification link confirming your registration. Click the link to automatically sign into the site.
You’re almost there!
We just sent you a verification email. Please click the verification button in the email. Once your email address is verified, you will have full access to all event content for theCUBE + NYSE Wired: Zero Trust Cyber Series.
I want my badge and interests to be visible to all attendees.
Checking this box will display your presense on the attendees list, view your profile and allow other attendees to contact you via 1-1 chat. Read the Privacy Policy. At any time, you can choose to disable this preference.
Select your Interests!
add
Upload your photo
Uploading..
OR
Connect via Twitter
Connect via Linkedin
EDIT PASSWORD
Share
Forgot Password
Almost there!
We just sent you a verification email. Please verify your account to gain access to
theCUBE + NYSE Wired: Zero Trust Cyber Series. If you don’t think you received an email check your
spam folder.
Sign in to theCUBE + NYSE Wired: Zero Trust Cyber Series.
In order to sign in, enter the email address you used to registered for the event. Once completed, you will receive an email with a verification link. Open this link to automatically sign into the site.
Sign in to gain access to theCUBE + NYSE Wired: Zero Trust Cyber Series
Please sign in with LinkedIn to continue to theCUBE + NYSE Wired: Zero Trust Cyber Series. Signing in with LinkedIn ensures a professional environment.
John Furrier, host of theCUBE, and Donny Greenberg, CEO of RunHouse, discussed the importance of a systems mindset in AI infrastructure. Donny, who previously worked at Meta, highlighted the challenges faced by ML teams and the inspiration behind starting RunHouse. Their platform aims to simplify distributed ML training and batch inference, bridging the gap between research and production environments. By automating code distribution and scaling, RunHouse enables ML engineers to focus on optimizing models and solving business problems. The conversation emphas...Read more
exploreKeep Exploring
What options are available for doing inference at scale and auto-scaling with serverless, open source, and SaaS offerings, in contrast to the lack of similar options for training machine learning models?add
What problem does the platform described in the text aim to solve for companies using machine learning code?add
What are some common challenges faced when trying to implement sophisticated data systems in a typical company setting?add
What challenges do companies face when trying to successfully run machine learning models and systems across multiple cloud platforms?add
What challenges did the product lead for PyTorch at Meta observe in the machine learning infrastructure across different organizations and cloud providers?add
>> Hello, welcome to theCUBE. I'm John Furrier, your host. We are wrapping down day three of our three days of wall-to-wall coverage on Wall Street. This is our East Coast studio at the NYSC. We've got Palo Alto. We've got Wall Street connecting technology and innovation across both coasts, creating a backbone of innovation and open source content and community with the NYSC Wired that Brian Baumann is leading up. We got Donny Greenberg here, the CEO of RunHouse. Great to have you on TheCUBE. We were just talking before we came on. Thanks for coming on.
Donny Greenberg
>> Great to be here. Thanks for having me.>> We were talking about when you guys started the company. We know any scale. We know the Berkeley crowd. Well, you know them. You're a lot younger than I am. I'm old. But you're doing all the hot stuff with these guys. And again, there's so much innovation coming out right now with machine learning, if you look at the past CS programs, you go back to say six years when machine learning really took off, you saw a whole wave of engineers coming in and developers and computer scientists and starting companies, pre-Gen AI. And then Gen AI was hitting the scene. Boom, the Transformer Paper comes out and then you see that wave coming. So anyone was doing machine learning pre-Gen AI was in a sweet spot. It's like the cars before the race. They're all lined up. Who's got the pole position? If you do an unsupervised machine learning, you were pretty much ahead of the crowd at that point. Let's talk about it.
Donny Greenberg
>> All this machine learning innovation is very long overdue. I think there was a long wave between 20 18, 2022 where companies were deriving real value from machine learning and it took a while to get there. And the infra just wasn't holding up. It was first generation. And then all my friends who were working on machine learning infrastructure in 2022 all of a sudden pivoted to be working on AI applications on top. And so actually there was a bit of a vacuum, or there has been a bit of a vacuum for the last two years. There's no reference architecture for enterprise AI infra, unfortunately. And now we're starting to see people coming back and working on it.>> It's very cool. And also the aperture of talent coming in is now that everyone sees the business model. There's no Zerpa era anymore. Zero interest rates, the funding. And by the way, the salaries on the Gen AI side are at an all time high too. So you can see it went from crypto to Gen AI. So you got a little bit of fashion going on there, but it's a legit wave. It's happening. But I love it. I mean, I love this time. I think it's the best time to be applying computer science because there's benefits. I was talking with the JP Morgan Chase CIO, Lori Beer, last week at AWS Reinvent. She came on theCUBE and I asked her, yeah, she's got Gen AI projects. They're all, I won't say sandbox. It's my word. She didn't say that. But she said, "We're still doing hardcore machine learning," because they're regulated. They got to have lockdown. So most of their effort's still AI, but because they have to nail the reference frameworks on their resilience. And that's infra. Infra is where the game is right now. Look at AWS Reinvent, what'd they talk about? Storage, Tranium 2, ultra clusters fabrics that are highly performant networked together, Ray Summit where you spoke this past year. Again, what are they talking about? Infra.
Donny Greenberg
>> Yeah, I mean, I think that the good thing about AI as a hype cycle as compared to crypto or whatever, pick your hype cycle, is that good AI at the end of the day is about good systems thinking. It is an optimization discipline. And when you look at JP Morgan, for example, they have applications that have been banged on by engineers for the last however many years to solve fundamental business value problems at JP Morgan. And that's where the team started. They're building a fraud detection pipeline or they're building churn prediction or ranking or recommendation. Whatever it is, there's a business objective on the other side of it. And they've actually made these systems over the course of many years extremely competitive to the point that actually it should take a while for them to adopt something really novel because it has to compete on performance or cost or accuracy with existing stuff. And I actually think that a contrarian view on the current Gen AI hype cycle is that the job of an ML engineer actually hasn't changed. It's been the job, when let's say, ResNets came out on the vision side. It's been the job of an ML engineer to look at that architecture or look at the models that are available on Hugging Face and say, "How can I take this and apply it to systems that I am already working on and optimizing?" And the same is true when you suddenly get access to an LLM or you want to incorporate such and such hot new architecture. Your job hasn't changed whatsoever. The truth is, though, you still need to be intellectually honest and say if it didn't work, it didn't work. And so the velocity of work, actually, it still is the name of the game. And that I think is where the infra work is so deeply, deeply needed because the velocity of work right now just can't keep pace with what a typical engineer sees on Twitter, let's say. They want to be able to take a model and fine tune it on dozens of GPUs without thinking about it. That's absolutely just not the case.>> Not the case. And by the way, just separating out the hype, great point. Prototyping's fun. A little easier, actually, but what you're talking about is, I want to get something in production. I want to take something that's got AI in it and apply it to me. That's validated by Andy Jassy, now the CEO of Amazon, who came on stage at AWS Reinvent and said, and I had the quote on theCUBE, and he also said on our program, when I interviewed him, which was a rare interview, he said, "The Gen AI applications are iterating it, but they're not as fast as you think." To your point, to do it right, to do it fast and get concept, okay, but to do it right is hard, which is why they entirely had their entire show based on infra, infrastructure. Why? Because that's the problem. That's the pacing car. That's the pacing item right now.
Donny Greenberg
>> Yeah, I think that we are currently in the Hadoop era of machine learning. So before OLAP databases like Snowflake, or we're in NYC, so we should talk about Snowflake, BigQuery, Redshift.>> We can talk about Databricks.
Donny Greenberg
>> Yeah. And, of course, Databricks.>> By the way, Databricks came out of the failure of Hadoop. So the question is, if we're in the Hadoop phase, is there a Databricks coming out?
Donny Greenberg
>> I guess we're also->> That could be Ray.
Donny Greenberg
>> Yeah. So I think that in the Hadoop era, as a data analyst or engineer, you had to do a lot of infra. You had a SQL query, which was your envelope of work that you wanted to execute. And then if it fit on a single VM, then you were okay. And if it didn't fit from either a data scale or a compute scale perspective, then you're doing infra. You have a team managing Hadoop clusters and you need to conform your query to actually make it runnable, and then you need to deal with a lot of faults. And then come OLAP databases, and all of a sudden a typical data analyst just does not even do an inch of infra. And that's a revolution because it democratizes scale to thousands and thousands of non-engineer SQL literate people inside of a company that can now do really, really massive things. And on the ML side, we just don't have that. We don't have the ability for a person to say, "I just want to think about ML. I just want to write a PyTorch training loop," or, "I just want to write a Ray program," and just throw it at the infra and not think about it. And that's a really critical missing piece to democratize ML inside of the Fortune 500 or wherever. And Databricks, I think is also a really, really good case study for success in that area because Databricks also put the compute at the data engineer's fingertips from wherever they already worked. So in the case of, let's say, an OLAP database, your envelope is a SQL query. You're throwing the SQL query at the system. Databricks allowed you to from inside of an Argo pipeline or inside of a notebook or wherever you happen to already work, just launch a Spark cluster in code and tap into it and just achieve this massive scale without even thinking about it. That's a beautiful thing that we need to do on the ML side.>> And that's infrastructure's code, a definition right there. And the point about Hadoop, just to clarify, because I think that's a great point, there was so much dependency on the infra that you lost sight of why you were doing it in the first place and that failed. And so I think, if you look at today, once you get that out of the way, you saw serverless became a good thing. Inference engines now. I mean, Amazon announced inference is a building block. That has implications to databases. You're starting to see that democratization. So where are you at all of this? I mean, talk about your venture, what you guys are doing. Is this something that you're trying to solve? What is your core thesis? Talk a little bit about what you guys are doing.
Donny Greenberg
>> So I actually think that inference is a really good place to start. I think that inference as an infrastructure building block is extremely competitive and actually decently mature at this point. I think that the choice of both serverless and open source and SaaS offerings for doing inference at scale and having it auto-scale and auto batch and even things that are deep in the LLMs to optimize is quite rich. On the training side, that mostly doesn't exist. On the training side, if you want to take an envelope of ML code and throw it at your infra and get it to do training, tap into your platform as a runtime as the compute itself, that doesn't exist. So that's the gap that we're trying to fill. So what we do is we basically give companies a platform to give to their own engineers and researchers where all they have to do is in code take their existing Python training loop or batch inference that they want to do and then specify how they want it distributed and what it needs to run. So I need X number of GPUs. And they just execute their code normally. It essentially throws the code at the platform and magically distributes it and scales it. And then, for a typical ML engineer, you're mainly asking the question, how fast do I want this to run? Not, can I get it running at all once I get it onto the compute? Then what's really, really important, and this is where we actually spend the majority of our time, is on fault tolerance and making sure that things actually move to production quickly. So to us, the big symptom that we solve at a lot of companies is actually hitting a scale wall. It's like we want to use Ray or we want to use distributed GPU, multi-node distributed GPU, or we want TensorFlow or Megatron or whatever, or DeepSpeed, and we're hitting this wall. So by making the platform automatically distribute the code, we're also democratizing that scale in the way that an OLAP database democratizes that scale. That's the core thing that we do.>> You're basically doing infrastructure at code for ML?
Donny Greenberg
>> Exactly. Yeah.>> Okay. So how's that going?
Donny Greenberg
>> It's great. Honestly, I think the resurgence of a systems mindset for AI has been good for us. We are increasingly finding companies that did some experimental work with new LLM technologies, and then as a matter of trying to productionize them or incorporate them into their existing ML stack, or even just modernize their ML stack to bridge over to let's say larger models or larger compute, they're blocked and they're facing the question of, do we want to spend nine months to a year just ripping apart our infrastructure and standing up an entirely new system that there's not really a lot of public reference architecture documentation about, or do we want some help? And that's basically where we've come in and that's been a really exciting thing to see, people having an ML systems mindset trying to optimize and grow their existing systems.>> What is a systems mindset? Again, I've been saying this on theCUBE for a long time, so I have my own opinion. I won't share it. People know. I riff on this all the time. Systems thinking is I think the state-of-the-art right now because we always have these waves, design thinking, iterative. Systems thinking is very relevant. I want to get you to define it in your words. What does that mean?
Donny Greenberg
>> I think the way that I use the phrase, which absolutely might not be in line with the zeitgeist, is the->> Zeitgeist is me basically at this point.
Donny Greenberg
>> Okay, perfect. So you can tell me I'm wrong.>> No, no.
Donny Greenberg
>> Immediate feedback. To me, systems thinking is beginning with the problem, knowing exactly what you need to solve. Doing Infra for the sake of it is exactly the opposite of this. And then building up the systems of both infrastructure and code and DevOps processes to solve the problem in the way that is most efficient as a matter of the business problem to be solved, the cost and performance and accuracy, et cetera, parameters, it's a combination of the technical problem and the human problem. But it has a ruthless focus on the problem itself. And so if you were, let's say, trying to build a system for interpreting unstructured text data like interpreting PDFs. If you had a systems mindset, you would reach for the easiest thing to try to unblock your initial proof of concept, let's say, with the lightest weight underlying system. So maybe you would reach for an LLM and ask it, "Oh, please extract these terms out of my PDFs." And then when that fell flat, which it often does, you would then not reach for, let's say, fine-tuning your own LLM, but actually going to traditional NLP and using named entity recognition. That's the next lowest path of least resistance to solving your problem. And I think gradually building up in the direction and managing your system as the product is a really elegant way to scale across teams. And I think the best evidence of this is how many world leading ML practitioners today are actually ML engineers? Because they were just people who have a really, really strong systems mindset. They were pointed at an existing system, which is maybe just chugging along running a daily training or something like that, and then said, "Can you do better?" And they just need to bang on that, and they're going to do spectacular things when they're given that playground to work in.>> Yeah. And they're going to reset, refactor, look at the architecture. That's consistent. There's no real definition. I'm just riffing on it. But it is a concept in my mind. What you're getting at is, what am I trying to do? And building an operating system. You think about operating systems, there's consequences when you do something. You got to look at elements that are involved and say, "Okay, if I optimize for this without understanding what it could impact, I solved this, but if it doesn't address what could go wrong, then I'm coming back here again." So back to that focus. A systems mindset is a holistic view of what's going on and, again, on the problem. So I like that little addition there because it's common sense.
Donny Greenberg
>> It can't be the working definition though because it took me five minutes to say it. So we'll have to compress.>> But I think when you're a systems thinker, when you're doing system architecture, you zoom out and you got to look at, what am I doing? What am I building here? And then going into the engine, and then there's elements and subsystems. You're building around it there. There's a core problem. There's a lot of things involved. So to me, I think that's counterintuitive to outside of operating systems type thinkers. The old program, I'm going to build some code, run it, and there it does its thing. Versus the system's already given it to them. It was predated or pre-existing. It could be cloud. I'm running on top of EC-II, whatever. So I think, as you start getting into Generative AI, where there's a lot of other things involved, okay, I'm using this LLM, I'm go bring this other LLM in. Or I got computer vision applications that's multi-modal. So I think the concept that we're seeing people start to think about, okay, now I got to design what's going to power it? How do I scale the infrastructure? Okay, assuming that's happening, who's designing that? Well, you've got to be a system thinker on that. Is it going to be a cluster? Is it going to be on-prem? Is it going to be in the cloud? So JP Morgan has the same problem as everyone else. Okay, it might run on-prem. Yeah, my data's here. I'm already in the cloud. Can I design the cloud to work with an on-prem system? Okay, that's an infrastructure challenge. So you zoom out. Someone's got to work on that and actually think it through and be like, "Okay, would that hang together?" And then, okay, let's try that and people do that, but not everybody. But everyone I think has to have the mindset. So if I'm a system thinker and I'm going to build the infrastructure at scale and the developers want to just have scalable infrastructure as code, then that team's got to do their job. Now I can solve my problem here. I think it's a scalable concept. It's just I'm seeing more successful people that I meet have that mindset and they're building something larger than what on paper it looks like. And I guess that's a long roundabout way. Again, there's no definition. I just see the pattern over and over again that the system thinking mindset is a lot like the design mindset. UX, remember the old UX wave, design thinking, user experience? What does that mean? Okay, better menus? No, user experience. Fast, reduced steps. Think about the holistic picture. And that to me fails if the data's not there. So if there's no data, that's the whole theme here. Anyway, I digress.
Donny Greenberg
>> I think there are actually a few really, really concrete phenomena that fall out of this. One of them is that, in a world where it's way easier to write code, and it depends on the code that you're writing, obviously, but let's say you're writing ML code, co-pilots and the availability of good open source examples makes it a lot faster to generate code. And then as a typical, let's say, ML practitioner in a world where things are a lot easier to do on just a POC basis, you actually are primarily blocked on either the data or the infra. So if you're in a typical company right now, actually, assuming that you have decent data governance, decent data access procedures, which is not a given, but let's say you're arriving on a team that already has some data systems in place, the data is solved. If you're doing something that is even quite sophisticated, you probably can find example code out there or get decently far writing a pretty sophisticated training loop or batch inference system or whatever with public code or with a co-pilot. So now you still have a platform problem. You have this super powerful data and super powerful code but nowhere to run it, and it's actually breaking your ability to do systems thinking because you need to now solve for so much of the layers all the way down to actually get your thing working. Whereas in a typical company where maybe now you have on-prem and multiple clouds, by far the norm for enterprise that we see is multiple clouds. This should be an existential problem to them because it should be the case that a typical engineer who starts, let's say this month at the company is able to look at some code, understand where and how it runs and begin to optimize it. If it's going to take them six months or a year, or if you're building something new and you grab some example code for fine-tuning an LLM or something like that, you're not going to be able to run it for a year, that actually is existential to your ability to do good ML and do good systems. And I think that the best ML thinkers that I've spoken to see ML inside of a company as a slot machine. It's like you're pulling a lever and nine out of 10 things are completely not going to work, but one of them is going to just deliver business results that you couldn't get any other way. If it takes you nine months to pull the lever, then you're not going to realize the real value. You're not going to do what Google and Facebook and Uber, et cetera did with their ML practices. You're not going to even find the magic that is ranking the Instagram Stories. Instagram Stories ranking is super important. If it can't be trivial for you to just introduce ML in all of these places, you're just not going to do it. And then your ML is just going to be executive mandates, which is not the same thing.>> Look at this, we're pedaling as fast as we can. We're not going anywhere.
Donny Greenberg
>> Right.>> All right, Donny, I got to get two more questions in. One, how did you get here on your journey? How did you get to start of the company? What led you to what you're working on now?
Donny Greenberg
>> I was the product lead for PyTorch at Meta. I worked with hundreds of ML teams inside and outside of Meta. PyTorch is a super widely used open source machine learning framework. What I saw across research through production, inside and outside of Meta, every cloud provider, most of the vendors, is that there is just a deep, deep infrastructure problem in ML where just everything is hard. There's no reference architecture. The researchers are given a sandbox to work in, which allows them to just debug and iterate, but it's too small scale that they can't really do what they're planning to do in production, and then the production systems are just impossible to debug. They're really, really faulty, they're not fault-tolerant, and it's just an extreme manual lift.>> Don't touch that button.
Donny Greenberg
>> Exactly. We don't know who trained that model or that person trained that model a .>> .
Donny Greenberg
>> We can never touch it. That fundamental problem, that just infra, there was not a singular platform as a runtime for distributed ML for training, for batch inference, for distributed specifically. Inference I think, actually, it's very competitive and it's getting better and better, but distributed is really struggling. So the platform as a runtime, just a system that your users can see as a unified surface and they can just dispatch any block of their code and scale it up and distribute it, that to us was the missing ingredient for all these teams that we were talking to. And so that's why my co-founder who is an old friend of mine who is a sole ML engineer at a Series D AIHR startup and stood up all their infra, Kubernetes cluster, their feature store, their serving, et cetera. So we just started working on a prototype and that's how I'm here.>> Nice. And you've seen the movie, you saw the Hadoop wave, you saw what went wrong there, large-scale clusters, saw all the warts?
Donny Greenberg
>> I led re-architectures one after another at Meta, and by the third it was like, "Okay, we should stop re-architecting.">> Scar tissue, big time. At your age, come on. You're a young gun. You've got the scar tissue. All right, final question. I have this question all the time. "Hey, I'm working in a job, I'm doing data science. I hate this company. It's so boring. State-run company. It's government," or, "I'm at a company, I'm bored. I don't know anything about ML, but I have a degree in math. I like to solve problems. How do I get into it?" There's a lot of, I won't say, fear, but indifference around, can I make it? What's your advice to folks that want to jump in and get into the game that might have the aptitude but might not even know it? Or what do they do to get going? They say, "Hey, I know I could probably do that if I just don't have the requisite skills. I've never played with PyTorch, but I've coded before. How do I get in? What do I do? Do I jump into open-source projects?" What would you advise folks? Because I see a lot of people wanting to come into the game. What should be your advice to that person?
Donny Greenberg
>> I think that, first of all, open source is almost always my default answer in this, because it's such a cheat code. You are talking to and showing your code skill to people who actually maintain the exact system that you're interested in working on at a massive company. It's such a cheat code. It's like you don't have to go through a screen. You don't have to submit a resume. You don't have to talk to a recruiter. You're just talking to the engineers.>> It's like a combine worker. How fast can you run the 40?
Donny Greenberg
>> Exactly.>> Look, that's fast.
Donny Greenberg
>> It's right in front of these engineers' eyeballs, and if you do a good job, then they will be motivated to work with you. And they understand that the people who are spending their time contributing to open source, some of them are purely enthusiasts and are doing it in their spare time. Some of them are trying to get more involved with the team. And so I think that, that is a really, really good default place to start. I think another place to start that's valuable is actually helping people who are in the fold of the field of AIML with the things that they are beyond scale to do. Every open source project and many companies can use content help writing examples and tutorials for their own systems that just run a training to completion and show how it works and record a video or offer to write a blog post. I can't even tell you how many different ways we've thought about mobilizing various different workforces to try to just generate more tutorial of blog post content for us. It's incredibly valuable. If you want take on a Galaxy Brain AI mindset about it, it's feeding the co-pilots or something. But it's so important.>> It's like interning. It's like you come in, you ingratiate in, help out, and then you'll learn by osmosis, or mentoring or contribution, people recognizing you and giving you a task, if you don't have the coding skills, right? That's what you're saying?
Donny Greenberg
>> Exactly. I think it's specifically...>> Go get my coffee. Write some content. I mean, it sounds pedestrian, but that's how you get in.
Donny Greenberg
>> I think that public artifacts of work are ultimately what a lot of people care about when they're talking to somebody who's outside of their field that they don't know. And, actually, it's similar to open source contribution. If you see a company that you admire and you like what they do, and you offer to write a blog post or a tutorial or something like that, and it's quality, obviously, if it's junk, they're not going to publish it, you get a public artifact of your work, you're helping those people in a very direct way, and you're training yourself. What's not to love about that?>> It's a cheat code.
Donny Greenberg
>> The thing is, obviously, it's expensive. It's expensive in your own time.>> Oh, no. It's a cheat code. If you're motivated, I mean, great.
Donny Greenberg
>> Yeah.>> Donny, thanks for coming on theCUBE. I wish we had more time. We'll definitely do a deep dive next time. Thanks for coming in.
Donny Greenberg
>> Thanks for having me.>> We got crunched by the Trump factor here, crunched our schedule down big time.
Donny Greenberg
>> I feel like I'm constantly competing at venues with the President-Elect.>> We got to stay away from him. We need more time. Thanks for coming on.
Donny Greenberg
>> Thanks.>> I'm John Furrier with theCUBE, breaking it down, again. Machine learning and democratization, a big theme, open source, authenticity, trust, quality, collaboration. Again, we're doing our best here in theCUBE wrapping up day three of wall-to-wall coverage. Thanks for watching.