KubeCon + CloudNativeCon NA 2025 | Niels Bantilan & Chris Matteson, Union.ai

Clips
More from KubeCon + CloudNativeCon NA 2025

Niels Bantilan

Chief ML Engineer

Union.ai

Chris Matteson

Head of Sales Engineering

Union.ai

play_circle_outline Unveiling KubeCon + CloudNativeCon 2025 in Atlanta: Meet Niels Bantilan and Chris Madison of Union.ai!

play_circle_outline Union: Open Source Tool Flight Simplifies Classical ML Operational Complexities for Enhanced Efficiency After Lyft Spinoff

play_circle_outline Transition from legacy orchestration tools to dynamic, agentic workflows in ML.

play_circle_outline Open source community engagement and feedback driving development of Flight 2.

Info
Transcript

Niels Bantilan & Chris Matteson, Union.ai

Niels Bantilan

Chief ML Engineer Union.ai

Chris Matteson

Head of Sales Engineering Union.ai

In this KubeCon + CloudNativeCon North America 2025 interview, Niels Bantilan, chief machine learning engineer, and Chris Matteson, head of sales engineering at Union.ai, join theCUBE Research’s Paul Nashawaty and host Rob Strechay to unpack how Union is tackling the emerging AI development infrastructure layer. They trace Union’s roots in Lyft’s ML platform team, explain how Flyte unifies disparate tools into durable workflows, and explore why platform engineers need a Kubernetes-like layer for AI to keep GPU-intensive pipelines reliable, automated and produ... Read more

explore Keep Exploring

What event is being discussed in the provided text? add

What is the background and history of the company Union? add

What are the limitations of traditional ETL tools in the context of machine learning, and how has Flight addressed these challenges? add

What can you tell me about the development and features of Flight one and Flight two? add

bolt Powered by CUBE AI

Niels Bantilan & Chris Matteson, Union.ai

search

Rob Strechay

>> Hello and welcome back to day two of KubeCon + CloudNativeCon North America 2025 from Now-getting-warm-lanta, because it is definitely getting warm. We are definitely warmed up in here and we're really cruising through day two, having a lot of fun, bringing some people who are new to theCUBE on here. I'm really excited to have Niels Bantilan, who's on who's the Chief ML engineer at Union.ai, and Chris Madison, who's the head of sales engineering for Union.ai as well.

>> Thank you.

Rob Strechay

>> Thanks for both coming on and also joining me. Hey, welcome to the show this week. I know you've been busy, a lot of briefings and stuff. Paul, welcome on board. Glad to have you.

Paul Nashawaty

>> Awesome to be here. Great to be here. It's been an exciting show. The show floor's buzzing. There's so much traffic. And Union is... I'm really looking forward to this presentation today.

Rob Strechay

>> Yeah, so before we go too deep in, let's understand who is Union and what you got going on, because I think you're hitting at a really interesting place in the market. Because people are looking for an easy button, and I think you guys are playing in that space, so why don't you go there?

Niels Bantilan

>> Yeah, definitely. So just a brief history lesson. Union is the company that spun out of Lyft. So there was an ML platform team at Lyft. They had all the classical ML problems around training models, deploying them to production, making sure they're reliable. And these are not the large deep learning models of today. These are XGBoost models that might fit on a CPU, but there's still operational complexity there. So this company spun out of Lyft, but while they were at Lyft, they built this tool called Flight and it was internal at first. It gained a whole lot of traction inside the company, and the platform team decided to donate this project to the open source, to the Linux Foundation. And so Union now is, I believe, around five years old, and we've been productionizing ML and AI applications and projects for quite a bit now. And we're seeing a lot of interesting changes, let's say, of late.

Rob Strechay

>> Yeah, I think it's been a hot topic. We've been talking about it this week, and I was at Infrastructure as Code, or IaCConf Connect, on Monday night doing a panel. And when I got off the stage and was talking to a lot of the platform engineering folks, their big struggle or toil was, "Okay, how do I even know how to deal with all of the toil around? I'm just trying to keep the other apps running. How do I also understand the pipelines and everything else that goes along with AI?" It seems like that there's really heading towards this emergence of AI development infrastructure. Help us understand where Flight fits in there as well.

Niels Bantilan

>> Yeah. Chris, you want to-

>> Definitely. So we've absolutely seen that same trend here of we have more and more infrastructure and more point tools trying to solve all of this together. And what we see it increasingly that slows things down. And where you see this is different, as you were talking the Infrastructure as Code and the platform people, is we've got these very expensive GPUs. We have these jobs that are not running forever, but they're often running for a very long time to do training across a lot of different machines at scale. And so I need to connect all of the various tools that I have in this, and I need to do it in a robust and durable way, where we constantly see people slowed down by the fact that something will break, and then they'll go and they'll manually fix it, and then something else will break. And you end up with these cycles where it takes you months in order to get something out. But the way the data ages, if it takes you six months to go get a model out, your model's already stale because your data's now stale. You need to do this again. What we do is we bring all of those pieces together, we tie in the various disparate tools, and allow you to build that as a workflow and a pipeline that we can then automate. And we'll take away all of the little minor challenges of, "Hey, it broke. Let's try it again," or "This API throttled us, let's try it again," or "We ran out of memory, let's give it more memory and try it again." We solve those problems for you. That increases the speed of your team, and that means that they actually get stuff out and get it done when it's still relevant so that they're able to get the production. And that's really the change that we're seeing in our community is the people that are experimenting with Jupyter Notebooks and playing around with other tools, they can build something that looks kind of cool as demo-ware. But people that get it to production have to systematize it, and that's where Flight and Union really come in.

Paul Nashawaty

>> So Chris, I want to simplify this because I think it makes a lot of sense. I understand it as what Union is doing, what Union.ai is doing for AI is what Kubernetes did for cloud native, right?

>> Absolutely.

Paul Nashawaty

>> You're basically unifying that layer to abstract away all the complexity, but you're reinforcing reliability and accelerating iteration. And that is a big factor. And when you're looking at AI workloads, that's a big factor.

>> Absolutely. And let's be real, there's still lots of complexity in Kubernetes. There's all of these tools, but what Kubernetes did is bring that together in a way that I, as a platform person, could actually use it and deploy this in production and not have to think about all of these little pieces myself. And absolutely, you're right that critically we're doing the same thing here for AI. You get so many moving parts, and it's something that you could no longer handle with a script. You no longer want somebody clicking through and doing it. You want this to work consistently everywhere, every time. And that's where tools like Kubernetes and now Flight are really critical to help solve those challenges.

Rob Strechay

>> Yeah, to me, it is about how do you solve for complexity failure... What is my failure domain? How do I understand and bring things into compliance and governance? And of course cost is in there too. So how do you see this really playing out to help people really get control on that AI complexity?

Niels Bantilan

>> Typically, if you are just starting today, and this has been true for a while now, a decade; I've been an ML engineer, practitioner for more than a decade now, is you kind of start small. You do start with this point solution. So you may start off getting a little data set from your SQL database, you'll train a model, and then you'll do some in-lab work. You'll see if it even works at all. So this is your POC phase. At some point, this just doesn't work. So at some point, you have to realize, you have to look at the bigger picture holistically, hence the name. You have to look at the union of all the tools that you're going to be bringing to bear on a particular problem, be it fraud detection or whatever. And the sooner you can start optimizing costs and the better equipped you are to just hit the ground running. Disaggregating your computer, you're not spending tens of GPU hours while it's just sitting there idle while you're doing some CPU-bound data processing. As soon as you can start disaggregating the steps of your various steps in your pipeline, it's better if you have a tool that's already capable of doing that, which is what Flight is all about.

Paul Nashawaty

>> So when we think about Flight, we know what it is, it unifies the platform. But we also talk about call it the four horseman of AI failure, Rob touched on these things, this is the reliability of the operational complexity with cost compliance. All those things matter. As we start moving to this next generation of Flight because Flight two is what you're talking about here at the show floor, a lot of excitement around it, your booth was really popping around it, I'd love to hear more about what that four horseman AI failure model looks like and how you overcome that.

Niels Bantilan

>> Yeah, sure. So the first thing is basically... We could talk about compliance. There's a lot of issues around security, data governance that with a lot of the EU regulations coming out, it's going to be a big issue. So that's something that a lot of organizations need to just have established, just take for granted, really.

Paul Nashawaty

>> The EU regulations, this is coming up real fast. We're talking about at the end of this year, September 2026, we have to have reporting in place. By the end of '27, all applications need to be in compliance. And there's some steep issues, like jail time that could be... So that's very important.

Niels Bantilan

>> Yeah, so we work with a lot of high security need companies, let's call it, and they need things like data lineage tracking. "Here's a model I have for production, it misbehaved in production. How do I trace it back to the data set that was used to train it?" So these are the kinds of things that Flight offers obviously on top of other things that I can go on about such as resource and permissions isolations at every step of the workflow.

Rob Strechay

>> So it comes with traceability basically and then your audit ability because we were just talking about that the other day and I'm assuming that also from that security perspective, it understands what those domains are as well and helps. But one of the other things that I would look at, and it seems like to me what it's doing is really helping go beyond where previous orchestration models have been going and run books and other things that you're doing and codifying it, but maybe those become brittle because they have to change over time. What are you seeing as the problems that this is solving from those legacy orchestration models?

Rob Strechay

>> Tools like Airflow, the really classic ETL tools are about solving point-wise, very static solutions of A, B, C, D, and we're done. We noticed immediately, and this is part of where Flight came in the first place is that didn't really work for ML. The way things needed to be a little bit more dynamic really required a different sort of approach. But in the early days when we wrote with Flight one, Python two seven days, there were still some limitations what we do. So we still did things while it was some dynamic components, it's still part of a DAG or Directed Async Graph, but now the community's really moved on to later versions of Python. People are used to using Async IO and we have increasingly dynamic things like AI agents where I don't actually know what step two is until after step one I generate a plan. Now we're in this world where I can't actually deal with a DAG, I can't be tied to that sort of staticness. I need to be able to dynamically understand whatever it is that we need to build. And so with Flight two, we've gone to a pure Python approach. Anything that you can write in Python, you can now run in Flight. And then we can get all the durability and scale that we bring to that, but now we're providing you greater speed and the ability to handle any sort of workflow you want. So I think it's a deep research sort of scenario or something that might underlie stock trading. I'm going to have a question about, "What is the research I needed to go do?" and I need to break that into looking at some historical stocks, looking in the news, looking at social media, and then we're going to go out each of those and be like, "Did I gather useful information here?" Let's make a decision on that if we need to do more research and could build that sort of tree. You couldn't have done that in any of these legacy tools. And that's critical today to how people are going to build things for the future.

Paul Nashawaty

>> So related to this, we're working on an economic validation with theCUBE Research where we're putting together this information and with ROI being on everybody's minds, especially when teams are trying to use AI native orchestration to optimize their cost, how does that really impact... You talk about moving away from heritage or legacy systems, moving to a more modern approach, that's very important because you're using the new tech stack. What does that mean forward as organizations are going to go... I guess what I'm asking is, how do you bridge the gap?

Paul Nashawaty

>> Yeah, think there's a couple pieces. Some of the core cost stuff that we've always done, and it's interesting, I've seen some of these trends change on the cost side where people were unable to get Nvidia GPUs and so they needed to make sure they're using those that they could get a maximum utilization. But for some organizations, money was no object. As we had the post-ChatGPT for the couple of years, they'll spend whatever money, they just had to get it. Now we're getting to the point where it's... I actually have to get ROI on these solutions. And so it becomes really critical that I make sure that I'm controlling what I'm doing in order to be as effective as possible through these steps. And again, we're able to help deliver with our customers that have the visibility into the cost, better utilize the resources, and then utilize Kubernetes under the hood to help us scale up and down. So let me use the resources as effectively as I can when I can in order to drive your cost down and then let you know what that is going to be. And at the end of the day, what costs you infinity dollars is not actually delivering anything. You have no ROI if nothing gets to production. And I think that's a critical thing that we're seeing and part of this economic validation is that Flight helps people get stuff to production that otherwise never would have. And so therefore, that's the only way you're going to make any of your money back.

Rob Strechay

>> Yeah. I look at it and the ROAI is negative on a lot of things because people haven't worked backwards from what the problem they're solving and how they're going about this, but we're at one of the largest open source conferences. Talk to how you see open source playing out within Union, again, knowing Flight is open source and things like that, but talk to that and where you see the ecosystem impact really going in the future here.

Niels Bantilan

>> Yeah, it's been really fun. I joined Union fairly early on four or five years ago, and it's great because Flight one was really... It started off open source, it gained traction. I think we've mentioned it's used by 3,000-plus companies from Stripe to Spotify to Woven Planet Toyota. It's a key strategic thing. We get contributions from all these awesome developers out there and we manage, we've built this community. And the fun thing about Flight two is it's actually we've kind of flipped the model a little bit. We've taken a bunch of feedback about the good things, the pain points around Flight one, around some of the gymnastics you might have to do to get something dynamic agentic workflow out there. We have internalized this and now we're... Flight two is really our gift back to the community based on all of that feedback. In a sense, we're not open sourcing it first. We built it in-house first and we're going to open source it by the end of the year, early next year. And it really is a confluence of all the lessons we've learned on top of the agentic AI trends that we're seeing. So the example that Chris mentioned about deep research, that's great. You can use for loops, while loops, conditionals, try/except, just all the stuff that Python offers, but it also just improves the developer experience for everyone else. If you're an ML engineer, data engineer who's been struggling kind of writing code, this will just make your life so much easier. And so once we open source it, we'll get the benefits again. We'll get contributions, we'll build the community again. But I'm really looking forward to when we open source it because I built the proof of concept internally. We called it Eager Workflows. It was a little clunky, but you can write Eager Workflows and you can see what it could do. Granted it wasn't perfect at that time, but now we've steered the Union ship toward that direction. I'm just super excited to-

Paul Nashawaty

>> You're pushing the bird out of the nest.

Niels Bantilan

>> Yeah, yeah. It's kind of scary, but it should be good.

Rob Strechay

>> I love it. So when we're sitting in Salt Lake in a year from now, what do you hope that you can say then that you can't say today?

Paul Nashawaty

>> One thing, and we continue to be really close with our open source community, we both have great love for the open source. I want to see everyone come along. This is super critical for us, and it's part of us releasing this is be like, "Here's the migration path for everybody to come along." And so now we would have people building, we've already built all these great, cool examples we've got with V1 where people use it at a massive scale, but I want to see them do even more stuff in V2 and be able to do it with less because we're that much more efficient, be able to do it that much faster. We've been seeing internal performance testing, I saw this morning, where we were doing one aspect four times faster, just the same thing. We're going to deliver all of this value to the open source community. I want to see them go take advantage of it and then build the new things. We built this for dynamic. We built this for change and for this world where I don't just have four steps, but I might have 400 steps, and I don't know what they all are. And we want to go see our community take advantage of that. So that's what I want to be talking about a year from now is: look at all these cool things that the community around Flight has taken advantage of and got them built.

Niels Bantilan

>> And on top of that, we're talking about open source. I don't want to commit to timelines, so I don't want to give our product team a heart attack, but I would love next year to talk about, "Here's our TypeScript SDK." We're focusing on Python because a lot of the data science and ML tooling has traditionally been around Python, but with AI engineering and AI, you mostly start off with making API calls to LLM providers. So really it's just focusing on the software and business logic. So TypeScript SDK, Go SDK, Java SDK, make it more accessible to more people because not everyone loves Python. Not everyone loves the crazy stuff you can do with Python, and they prefer other languages. And now in this AI engineering phase, it's just a lot more feasible to build this kind of software with other languages.

Rob Strechay

>> Absolutely.

Rob Strechay

>> Love it. Love it. Thank you for coming on board. This has been great. Great to getting to know you guys. And now I can't wait to be talking to you, if not in Amsterdam, in Salt Lake. So thanks for coming on board.

Niels Bantilan

>> Super excited to be here every year.

Niels Bantilan

>> Thank you so much for having us, yeah.

Rob Strechay

>> Very good. And thank you as always as well.

Paul Nashawaty

>> Of course. Thank you, Rob.

Rob Strechay

>> Thank you, Paul. And thank you for watching this episode of KubeCon + CloudNativeCon North America 2025, and we're still here in warming up Atlanta. Coming back to you soon on theCUBE, the leader in tech analysis and news. See you soon.