This conversation explores voice and multimodal artificial intelligence, AI, infrastructure for agentic applications and the technical and product factors to consider when building conversational agents that can perceive and act. The discussion highlights open-source streaming infrastructure, real-world enterprise deployments and safety approaches for non-deterministic interactions.
Russ d'Sa of LiveKit is the chief executive officer and co-founder. D'Sa describes LiveKit’s open-source streaming platform and how it enables multimodal AI agents that can see, hear and speak. Gemma Allen of theCUBE hosts and theCUBE Research produces the interview for NYSE Wired at the AI Agent Conference 2026. D'Sa traces LiveKit’s origins during the COVID-19 pandemic and explains the company’s role in scaling ChatGPT Voice, its rapid commercial growth, and the technical challenges of building agentic applications.
Key takeaways include D'Sa’s emphasis on simulation-based testing to ensure safety and reliability for conversations driven by large language model, LLM, systems. D'Sa highlights a market bifurcation between novel consumer assistants and goal-oriented enterprise voice AI and argues that platform tooling and the entire development lifecycle—development, testing, deployment, scaling and observation—must be redesigned to make voice AI applications as easy to build as web applications.
Relevant topics covered: voice AI, multimodal AI, streaming platform, open-source infrastructure, agentic applications, AI infrastructure, simulation-based testing, safety, enterprise voice AI, ChatGPT Voice, large language model systems.
Forgot Password
Almost there!
We just sent you a verification email. Please verify your account to gain access to
theCUBE + NYSE Wired: AI Agent Conference. If you don’t think you received an email check your
spam folder.
Sign in to theCUBE + NYSE Wired: AI Agent Conference.
In order to sign in, enter the email address you used to registered for the event. Once completed, you will receive an email with a verification link. Open this link to automatically sign into the site.
Register For theCUBE + NYSE Wired: AI Agent Conference
Please fill out the information below. You will recieve an email with a verification link confirming your registration. Click the link to automatically sign into the site.
You’re almost there!
We just sent you a verification email. Please click the verification button in the email. Once your email address is verified, you will have full access to all event content for theCUBE + NYSE Wired: AI Agent Conference.
I want my badge and interests to be visible to all attendees.
Checking this box will display your presense on the attendees list, view your profile and allow other attendees to contact you via 1-1 chat. Read the Privacy Policy. At any time, you can choose to disable this preference.
Select your Interests!
add
Upload your photo
Uploading..
OR
Connect via Twitter
Connect via Linkedin
EDIT PASSWORD
Share
Forgot Password
Almost there!
We just sent you a verification email. Please verify your account to gain access to
theCUBE + NYSE Wired: AI Agent Conference. If you don’t think you received an email check your
spam folder.
Sign in to theCUBE + NYSE Wired: AI Agent Conference.
In order to sign in, enter the email address you used to registered for the event. Once completed, you will receive an email with a verification link. Open this link to automatically sign into the site.
Sign in to gain access to theCUBE + NYSE Wired: AI Agent Conference
Please sign in with LinkedIn to continue to theCUBE + NYSE Wired: AI Agent Conference. Signing in with LinkedIn ensures a professional environment.
Are you sure you want to remove access rights for this user?
Details
Manage Access
email address
Community Invitation
Russ d'Sa, LiveKit
This conversation explores voice and multimodal artificial intelligence, AI, infrastructure for agentic applications and the technical and product factors to consider when building conversational agents that can perceive and act. The discussion highlights open-source streaming infrastructure, real-world enterprise deployments and safety approaches for non-deterministic interactions.
Russ d'Sa of LiveKit is the chief executive officer and co-founder. D'Sa describes LiveKit’s open-source streaming platform and how it enables multimodal AI agents that can see, hear and speak. Gemma Allen of theCUBE hosts and theCUBE Research produces the interview for NYSE Wired at the AI Agent Conference 2026. D'Sa traces LiveKit’s origins during the COVID-19 pandemic and explains the company’s role in scaling ChatGPT Voice, its rapid commercial growth, and the technical challenges of building agentic applications.
Key takeaways include D'Sa’s emphasis on simulation-based testing to ensure safety and reliability for conversations driven by large language model, LLM, systems. D'Sa highlights a market bifurcation between novel consumer assistants and goal-oriented enterprise voice AI and argues that platform tooling and the entire development lifecycle—development, testing, deployment, scaling and observation—must be redesigned to make voice AI applications as easy to build as web applications.
Relevant topics covered: voice AI, multimodal AI, streaming platform, open-source infrastructure, agentic applications, AI infrastructure, simulation-based testing, safety, enterprise voice AI, ChatGPT Voice, large language model systems.
>> Welcome back to theCUBE Studio here at the New York Stock Exchange. This is NYSE Wired, one of our programs around this year with the AI Agent Conference happening in New York next week. And joining me now is one of the folks who's been honored on the AI Agent 100 list. Russ d'Sa, CEO and co-founder of LiveKit. Welcome, Russ.
Russ d'Sa
>> Thanks so much, Gemma. I appreciate it.
Gemma Allen
>> So for those not familiar with LiveKit, maybe talk to me a little bit about this company and what the specific problem is that you guys solve.
Russ d'Sa
>> Yeah. So LiveKit is open source infrastructure that helps you build agents that can do something they've never been able to do before and that is they can see the world, they can hear and they can speak so that you can interact with an AI agent the way that you interact with another person, a human. We started off in a very different world during the pandemic as audio and video streaming infrastructure. When the pandemic hit and you were stuck at home and you couldn't leave your house and you were connecting with people over the internet, it turns out that the internet wasn't actually designed for this purpose. HTTP, everyone's typed it into a browser. It stands for hypertext transfer protocol. And so it was designed for transferring text between computers, not for transferring voice or video data that we needed it for during the pandemic. And so started an open source project to make it easier for developers to build applications that could stream audio and video. We spent about two years working on a commercial product after our open source launch. And then we launched that commercial product at the end of 2022. End of 2022, ChatGPT, this other amazing piece of technology, comes out, and thought it would be really cool to build a demo where instead of texting with a computer like you were with ChatGPT, maybe you could talk to that computer if you used LiveKit's infrastructure. And so we built this demo, first demo in the world where you could have a multi-turned back and forth conversation with an AI model in real time. Tweeted the demo out in April of 2023 and we were for sure thinking we were going to go viral. It only got 100 likes, so we were pretty upset. But then five months later, OpenAI found that demo and read the blog post about how we built it. And they actually built ChatGPT Voice on top of our commercial product LiveKit Cloud. And so now I guess in an indirect way, kind of under the hood, hundreds of millions of people around the world have actually used LiveKit because of ChatGPT Voice mode.
Gemma Allen
>> Wow. What a fascinating story. First of all, the fact that you started this in 2020 when the world knew very little of ChatGPT and the future and just how close generative AI was. And then the idea of Sam Altman or somebody from the OpenAI team just stumbling across this, right? It just goes to show this world of-
Russ d'Sa
>> Well, I think ... Totally. I think it's interesting because when we started during the pandemic, we did not anticipate that we were going to be talking to AI. It was still pretty early for even LLM technology back then. And we started off as network infrastructure, but what happened was once OpenAI built this initial application where you could talk to the GPT model in real time using LiveKit, we realized that there's this future we're moving towards where as the computer gets smarter and smarter and smarter and more human-like, the way that you interact with that computer also becomes more human-like. And so humans interact with eyes, ears, and mouth, and the equivalent sensors for a computer that match eyes are cameras and ears or microphones and the mouth is a speaker. And so the same infrastructure that we worked on to connect a human to another human during the pandemic can actually be used to connect a human to a computer in this world of multimodal AI.
Gemma Allen
>> So I really want to talk about the tech and the use case, but first, before we go there, tell me a little bit about the company stage. I know you mentioned just before we came on camera that you've had a lot of growth this year. Set out the stall here for LiveKit. What stage of maturity perspective and commercially too, where are you guys at?
Russ d'Sa
>> Yeah, so we've grown a lot in 2025, and I think a large part of the reason for that is when we started to work with OpenAI, this market of voice AI formed around ... Or you could call the ChatGPT Voice mode feature, it kind of created or catalyzed this entire voice AI industry that is now kind of exploding and there's thousands of companies now in the space. And so our growth has also followed that trajectory just because we, in some form, created this industry with OpenAI through this thing that we built together. And so LiveKit's growth, at that time when we started to work with OpenAI, we were just a seed series A company. I think we were still seed stage when we started to work with them. And then now today, fast forward to today, in 2025, team went from 20 to 120 people. The company has gone from 20 million in funding to 180 million in funding. And just most recently at the end of last year, we raised $100 million series C at a billion dollar valuation. And so the growth and trajectory of the company, there's probably around ... I think just starting 2025, we were in the low, maybe a low five-digit number of people on the platform, that were building on the platform, and now there's probably close to 400,000 developers building on the platform. Well over now 10% of the Fortune 500 build with LiveKit. And so the growth has really been amazing. We're doing several billions of AI voice sessions around the world every year.
Gemma Allen
>> Wow. Okay. So you mentioned obviously OpenAI are a customer, client, I'm sure a large strategic player for you. You've also grown this into enterprise. It's a direct B2B offering or it's a direct to dev offering. Talk to me a little bit about the deployment and the actual execution of this. How are you seeing the rubber meet the road from a tech perspective?
Russ d'Sa
>> What's so interesting is that voice AI is a greenfield market, meaning that it's a brand new use case. You haven't really been able to talk to the computer before like a person and have a back and forth exchange. And so because it's brand new, the infrastructure for it is fundamentally different from building a web application. What that means is that large enterprises that want to build this kind of functionality, and all the way down to small startups or individual developers that want to build these kinds of applications, they all need a solution and that solution ends up being very similar. So what's been interesting about our market penetration is that there's thousands of developers and there's also lots of enterprises, and they're all actually looking for a very similar thing. Now, it doesn't mean that there isn't some enterprise readiness or enterprise specific features around compliance and security and deployment model, whether it's on prem or whether it's a cloud service. Those things do exist, but the core infrastructure for building that voice AI application is roughly the same across these different market segments. And so there's smaller startups that use us for things like patient intake in a hospital. Assort Health is a customer of ours. I call them a small startup, but they're scaling actually very quickly. But all the way up to Salesforce, Agentforce Voice is powered by LiveKit and Salesforce is one of the biggest companies in the world. And so yeah, I think the same infrastructure kind of applies in both areas.
Gemma Allen
>> Wow, that's so fascinating. So you have multi-use case really. You have this sitting alongside CRMs and within environments and also fully embedded as well. So it's a kind of multiplay approach for you guys.
Russ d'Sa
>> Yeah. I think I would say the voice AI market actually is broken down into two broad buckets of categories. I think that there's one which are these sort of brand new novel use cases. You can think of ChatGPT Voice as an example of this where people have a super smart digital assistant in their pocket that can answer questions about almost anything. These are very brand new. They haven't existed before. And so novel use cases that some companies are building. Then there's another kind of broad category of use case, which is what I call goal-oriented voice AI. And goal-oriented voice AI are trying to get the user through a workflow and primarily via a phone call. So there are millions and millions of phone calls around the world every day where a person is calling a business or someone that works for a business. So you can think of, as an example, the healthcare industry or a bank in financial services, and there's some workflow that the user is trying to accomplish. Maybe it's a wire confirmation, or maybe it's patient intake at a hospital, or maybe it's an insurance eligibility check so that a patient can get services. And so for those use cases, traditionally, there's been a human answering the phone on the other end on the behalf of the business. And now what a lot of companies are trying to do is actually take that human and put an AI model instead, at least at the frontline, answering that initial call. And so those tend to be more around these large legacy enterprise kind of industries, so like logistics, banking, and financial services, healthcare. And so yeah, I think these are the two broad buckets of type of application that people have been building in voice AI.
Gemma Allen
>> So in that example you gave around wire transfer or banking, we also know of scenarios lately where we've seen extremely realistic imitations of Jamie Dimon's voice, et cetera, ringing a bank. We hear a lot about the risks from a control and reliability perspective of using simulated humans in these environments. Talk to me a little bit about what you're seeing, hearing, and even on the innovation and tech side, what you're building for in that space that can have certain guardrails and compliancy checks embedded. How is this actually playing out in real time?
Russ d'Sa
>> So I think that when it comes to safety and guardrails, and how do you make sure that the experience is authentic on both sides of the conversation? So there's the person that is interacting with the AI, and then there is the AI itself, and what is the LLM doing? And so I think on the LLM side, you want to make sure that it's not hallucinating, it's not breaking past its guardrails or instructions that have been set up for it. You want to make sure that it's calling the right tools, it's facilitating the workflow predictably that the company that it represents intends for it to do. And so on that side, the way that we tackle that problem is through something called simulation. And what you're doing ... So at the core of this application, this voice application, is an LLM. And LLMs are stochastic, meaning that when you give it a set of inputs, it's not always going to generate the same set of outputs every time. And so you kind of have to test an application like this differently than you would've tested like a traditional web application or traditional software because it's non-deterministic. And so what you end up doing is you test that application in a similar way that we kind of test humans. We check if a human has gone to college and if they've gotten a degree and they have a resume, and we do background checks and reference calls and stuff like that, job interviews. And so what we're trying to do is build a statistical confidence that the human is going to perform a task repeatably with precision at scale. And so you have to kind of test an AI model in a similar way. So you run these simulations where you change the input slightly. So the prompt changes or the language changes or an accent changes, and you're running thousands of these simulations against a success criteria at the end to make sure that that agent through a conversation, a simulated conversation, ends up meeting the success criteria. So that's kind of some of the safety and protocol that people go through for the AI side. On the human side, it's like, well, how do you make sure that it's not a nefarious actor deepfaking somebody else's voice and things like that? And so I think there's part of it that is still being worked on by collectively the industry, which is like how do you detect cases of fraud in real time over an audio stream or combining audio with a video recognition, vision to be like, "Okay, well, this is the right person saying something, this is actually their voice." So I think the fingerprinting exercise for their voice pattern and how do you detect these kinds of things, authenticity over audio and video streams is an area of active research that we've been talking with model companies that specialize in these kind of online fraud detection models and integrating them into our platform so that developers can leverage that. So that's one thing that we're working on with a few other folks out there. And then there's the authentication schemes of the developers that are building on top of LiveKit already. So it's like, well, let's imagine that a hospital is a customer and they're trying to allow the user to call in and talk to an AI. Well, they have to do some kind of authentication step either over the call or they need to do an authentication step beforehand through their app or whatever that already exists before that call is allowed to be made. And so that is kind of external to LiveKit and what we facilitate, and that's up to the developer to enforce a level of safety beforehand.
Gemma Allen
>> So it's certainly a multi-trillion dollar problem to solve from the perspective of security in this agentic world. When you think about LiveKit in the next 12 months or years ahead, where are you really honing your focus? It seems as though there is so much opportunity for this technology. It has so many use cases. Multimodal from the perspective of not just frontier LLMs, but even the whole world of SaaS, right? I know there's a customer textbox, there's an opportunity for this. How do you tie your hat to one particular space or how do you think about things, Russ? Are you going all in one vertical or are you just looking at the whole world of AI and agentic across the board and making a sweeping goal?
Russ d'Sa
>> So I talked to my team about this a little bit, and there's this video that's been shared around on social in the last few months of Jensen, the NVIDIA CEO, for those not familiar, but I think everyone's probably pretty familiar, but Jensen has this video from the early days. I think it's like the early 2000s where he's talking about how he doesn't have to change the world in a year. He's going to change it in 50 years. And the way you do that is you have this multi-phase approach to doing so. And each phase is focused and simpler and easier to digest. And when it's simpler, it's easier also for your team to execute on flawlessly. It's very hard to boil the ocean, and so you kind of carve out your vision of where you're ultimately trying to get to across a bunch of stages. And so for us, the next 12, 18 months, it's really laser focused on voice AI. I think that as these AI models improve, the way you interact with AI is going to be much more like you interact with people. And so how can we support all of those applications? I think a thing that was underappreciated by us, but I think that most people and even most developers out there don't realize, is that we started off as network infrastructure to stream the voice data back and forth between human and the AI model. But it turns out that when we started to work with OpenAI and deploy Voice mode and scale it up, we realized that the entire development life cycle for an application that you can interact with like a person, the entire development lifecycle has to change, every stage of it. You can't build the application the same way, test it the same way, deploy it the same way, run it and scale it the same way, and observe it the same way. Every part of that entire iteration loop has to be redesigned for a computer that can be interacted with like a person. And so it turns out that you can't build the application the same way, test it the same way, deploy it the same way, run and scale it, or observe it the same way. So for an application that you can interact with with a camera and a microphone, like a human being, it can see you, it can hear you, it can speak back to you, everything underneath in the way that application is built and scaled and run and observed has to change. And so what we're focused on for the next 12 to 18 months is really just building all of those pieces. What we want to do is we want to make creating a voice AI application as easy and as familiar as building a web application is for people. Today, that's something that we've been doing for 30 years. Everybody knows how to do it and how to move really fast on these web applications. How do we create a platform that makes that possible for this new type of application? And so every single piece end-to-end is what we're really focused on for the next 12 to 18 months. After that comes the next phase, which is like, well, these applications are going to run on different devices. And so how do you make sure it transcends every device? And these applications are going to run for long periods of time. We see what's happening with OpenClaw and these other kinds of agents that are autonomous. Well, how do you go from just synchronous conversations with a LiveKit agent to autonomous agents that are built on top of LiveKit as well? And then what ends up happening is you have autonomous agents and they can talk to you. Eventually, you go from those agents being able to just manipulate bits on the internet to suddenly like, well, if I put that agent into a humanoid robot, now that agent can actually manipulate atoms. And so it can do work in the physical world and perceive the world the way that humans do. It turns out that digital AI has about a 50% overlap with physical AI, but even for physical AI, you have to build a lot of new stuff that are specific to that kind of domain. And so I think eventually over time, what you'll see live could do is evolve from agents that you're talking to agents that can do all kinds of things and interact with computers in a very similar way that humans can, more generally speaking, and then towards agents that can roam the physical world and manipulate atoms and do work out in the physical world, like your dishes or build houses, or et cetera. It's going to be pretty exciting once we get to that kind of world.
Gemma Allen
>> Well, as much as I love the idea of a humanoid unloading my dishwasher or maybe even helping me raise my children, I think another thing that Jensen says is you focus on the fundamentals. And it certainly sounds as though it's exactly what you guys are doing at LiveKit. So congrats on all your success and being part of the AI Agent 100 list, and hopefully see you in New York, Russ.
Russ d'Sa
>> Thank you so much, Gemma. I really appreciate it.
Gemma Allen
>> I'm Gemma Allen, coming to you from theCUBE Studio here at the New York Stock Exchange. This is part of our program with NYSE Wired. We are talking all things AI Agent Conference happening here in New York this coming week. Thanks for watching.
>> Welcome back to theCUBE Studio here at the New York Stock Exchange. This is NYSE Wired, one of our programs around this year with the AI Agent Conference happening in New York next week. And joining me now is one of the folks who's been honored on the AI Agent 100 list. Russ d'Sa, CEO and co-founder of LiveKit. Welcome, Russ.
Russ d'Sa
>> Thanks so much, Gemma. I appreciate it.
Gemma Allen
>> So for those not familiar with LiveKit, maybe talk to me a little bit about this company and what the specific problem is that you guys solve.
Russ d'Sa
>> Yeah. So LiveKit is open source infrastructure that helps you build agents that can do something they've never been able to do before and that is they can see the world, they can hear and they can speak so that you can interact with an AI agent the way that you interact with another person, a human. We started off in a very different world during the pandemic as audio and video streaming infrastructure. When the pandemic hit and you were stuck at home and you couldn't leave your house and you were connecting with people over the internet, it turns out that the internet wasn't actually designed for this purpose. HTTP, everyone's typed it into a browser. It stands for hypertext transfer protocol. And so it was designed for transferring text between computers, not for transferring voice or video data that we needed it for during the pandemic. And so started an open source project to make it easier for developers to build applications that could stream audio and video. We spent about two years working on a commercial product after our open source launch. And then we launched that commercial product at the end of 2022. End of 2022, ChatGPT, this other amazing piece of technology, comes out, and thought it would be really cool to build a demo where instead of texting with a computer like you were with ChatGPT, maybe you could talk to that computer if you used LiveKit's infrastructure. And so we built this demo, first demo in the world where you could have a multi-turned back and forth conversation with an AI model in real time. Tweeted the demo out in April of 2023 and we were for sure thinking we were going to go viral. It only got 100 likes, so we were pretty upset. But then five months later, OpenAI found that demo and read the blog post about how we built it. And they actually built ChatGPT Voice on top of our commercial product LiveKit Cloud. And so now I guess in an indirect way, kind of under the hood, hundreds of millions of people around the world have actually used LiveKit because of ChatGPT Voice mode.
Gemma Allen
>> Wow. What a fascinating story. First of all, the fact that you started this in 2020 when the world knew very little of ChatGPT and the future and just how close generative AI was. And then the idea of Sam Altman or somebody from the OpenAI team just stumbling across this, right? It just goes to show this world of-
Russ d'Sa
>> Well, I think ... Totally. I think it's interesting because when we started during the pandemic, we did not anticipate that we were going to be talking to AI. It was still pretty early for even LLM technology back then. And we started off as network infrastructure, but what happened was once OpenAI built this initial application where you could talk to the GPT model in real time using LiveKit, we realized that there's this future we're moving towards where as the computer gets smarter and smarter and smarter and more human-like, the way that you interact with that computer also becomes more human-like. And so humans interact with eyes, ears, and mouth, and the equivalent sensors for a computer that match eyes are cameras and ears or microphones and the mouth is a speaker. And so the same infrastructure that we worked on to connect a human to another human during the pandemic can actually be used to connect a human to a computer in this world of multimodal AI.
Gemma Allen
>> So I really want to talk about the tech and the use case, but first, before we go there, tell me a little bit about the company stage. I know you mentioned just before we came on camera that you've had a lot of growth this year. Set out the stall here for LiveKit. What stage of maturity perspective and commercially too, where are you guys at?
Russ d'Sa
>> Yeah, so we've grown a lot in 2025, and I think a large part of the reason for that is when we started to work with OpenAI, this market of voice AI formed around ... Or you could call the ChatGPT Voice mode feature, it kind of created or catalyzed this entire voice AI industry that is now kind of exploding and there's thousands of companies now in the space. And so our growth has also followed that trajectory just because we, in some form, created this industry with OpenAI through this thing that we built together. And so LiveKit's growth, at that time when we started to work with OpenAI, we were just a seed series A company. I think we were still seed stage when we started to work with them. And then now today, fast forward to today, in 2025, team went from 20 to 120 people. The company has gone from 20 million in funding to 180 million in funding. And just most recently at the end of last year, we raised $100 million series C at a billion dollar valuation. And so the growth and trajectory of the company, there's probably around ... I think just starting 2025, we were in the low, maybe a low five-digit number of people on the platform, that were building on the platform, and now there's probably close to 400,000 developers building on the platform. Well over now 10% of the Fortune 500 build with LiveKit. And so the growth has really been amazing. We're doing several billions of AI voice sessions around the world every year.
Gemma Allen
>> Wow. Okay. So you mentioned obviously OpenAI are a customer, client, I'm sure a large strategic player for you. You've also grown this into enterprise. It's a direct B2B offering or it's a direct to dev offering. Talk to me a little bit about the deployment and the actual execution of this. How are you seeing the rubber meet the road from a tech perspective?
Russ d'Sa
>> What's so interesting is that voice AI is a greenfield market, meaning that it's a brand new use case. You haven't really been able to talk to the computer before like a person and have a back and forth exchange. And so because it's brand new, the infrastructure for it is fundamentally different from building a web application. What that means is that large enterprises that want to build this kind of functionality, and all the way down to small startups or individual developers that want to build these kinds of applications, they all need a solution and that solution ends up being very similar. So what's been interesting about our market penetration is that there's thousands of developers and there's also lots of enterprises, and they're all actually looking for a very similar thing. Now, it doesn't mean that there isn't some enterprise readiness or enterprise specific features around compliance and security and deployment model, whether it's on prem or whether it's a cloud service. Those things do exist, but the core infrastructure for building that voice AI application is roughly the same across these different market segments. And so there's smaller startups that use us for things like patient intake in a hospital. Assort Health is a customer of ours. I call them a small startup, but they're scaling actually very quickly. But all the way up to Salesforce, Agentforce Voice is powered by LiveKit and Salesforce is one of the biggest companies in the world. And so yeah, I think the same infrastructure kind of applies in both areas.
Gemma Allen
>> Wow, that's so fascinating. So you have multi-use case really. You have this sitting alongside CRMs and within environments and also fully embedded as well. So it's a kind of multiplay approach for you guys.
Russ d'Sa
>> Yeah. I think I would say the voice AI market actually is broken down into two broad buckets of categories. I think that there's one which are these sort of brand new novel use cases. You can think of ChatGPT Voice as an example of this where people have a super smart digital assistant in their pocket that can answer questions about almost anything. These are very brand new. They haven't existed before. And so novel use cases that some companies are building. Then there's another kind of broad category of use case, which is what I call goal-oriented voice AI. And goal-oriented voice AI are trying to get the user through a workflow and primarily via a phone call. So there are millions and millions of phone calls around the world every day where a person is calling a business or someone that works for a business. So you can think of, as an example, the healthcare industry or a bank in financial services, and there's some workflow that the user is trying to accomplish. Maybe it's a wire confirmation, or maybe it's patient intake at a hospital, or maybe it's an insurance eligibility check so that a patient can get services. And so for those use cases, traditionally, there's been a human answering the phone on the other end on the behalf of the business. And now what a lot of companies are trying to do is actually take that human and put an AI model instead, at least at the frontline, answering that initial call. And so those tend to be more around these large legacy enterprise kind of industries, so like logistics, banking, and financial services, healthcare. And so yeah, I think these are the two broad buckets of type of application that people have been building in voice AI.
Gemma Allen
>> So in that example you gave around wire transfer or banking, we also know of scenarios lately where we've seen extremely realistic imitations of Jamie Dimon's voice, et cetera, ringing a bank. We hear a lot about the risks from a control and reliability perspective of using simulated humans in these environments. Talk to me a little bit about what you're seeing, hearing, and even on the innovation and tech side, what you're building for in that space that can have certain guardrails and compliancy checks embedded. How is this actually playing out in real time?
Russ d'Sa
>> So I think that when it comes to safety and guardrails, and how do you make sure that the experience is authentic on both sides of the conversation? So there's the person that is interacting with the AI, and then there is the AI itself, and what is the LLM doing? And so I think on the LLM side, you want to make sure that it's not hallucinating, it's not breaking past its guardrails or instructions that have been set up for it. You want to make sure that it's calling the right tools, it's facilitating the workflow predictably that the company that it represents intends for it to do. And so on that side, the way that we tackle that problem is through something called simulation. And what you're doing ... So at the core of this application, this voice application, is an LLM. And LLMs are stochastic, meaning that when you give it a set of inputs, it's not always going to generate the same set of outputs every time. And so you kind of have to test an application like this differently than you would've tested like a traditional web application or traditional software because it's non-deterministic. And so what you end up doing is you test that application in a similar way that we kind of test humans. We check if a human has gone to college and if they've gotten a degree and they have a resume, and we do background checks and reference calls and stuff like that, job interviews. And so what we're trying to do is build a statistical confidence that the human is going to perform a task repeatably with precision at scale. And so you have to kind of test an AI model in a similar way. So you run these simulations where you change the input slightly. So the prompt changes or the language changes or an accent changes, and you're running thousands of these simulations against a success criteria at the end to make sure that that agent through a conversation, a simulated conversation, ends up meeting the success criteria. So that's kind of some of the safety and protocol that people go through for the AI side. On the human side, it's like, well, how do you make sure that it's not a nefarious actor deepfaking somebody else's voice and things like that? And so I think there's part of it that is still being worked on by collectively the industry, which is like how do you detect cases of fraud in real time over an audio stream or combining audio with a video recognition, vision to be like, "Okay, well, this is the right person saying something, this is actually their voice." So I think the fingerprinting exercise for their voice pattern and how do you detect these kinds of things, authenticity over audio and video streams is an area of active research that we've been talking with model companies that specialize in these kind of online fraud detection models and integrating them into our platform so that developers can leverage that. So that's one thing that we're working on with a few other folks out there. And then there's the authentication schemes of the developers that are building on top of LiveKit already. So it's like, well, let's imagine that a hospital is a customer and they're trying to allow the user to call in and talk to an AI. Well, they have to do some kind of authentication step either over the call or they need to do an authentication step beforehand through their app or whatever that already exists before that call is allowed to be made. And so that is kind of external to LiveKit and what we facilitate, and that's up to the developer to enforce a level of safety beforehand.
Gemma Allen
>> So it's certainly a multi-trillion dollar problem to solve from the perspective of security in this agentic world. When you think about LiveKit in the next 12 months or years ahead, where are you really honing your focus? It seems as though there is so much opportunity for this technology. It has so many use cases. Multimodal from the perspective of not just frontier LLMs, but even the whole world of SaaS, right? I know there's a customer textbox, there's an opportunity for this. How do you tie your hat to one particular space or how do you think about things, Russ? Are you going all in one vertical or are you just looking at the whole world of AI and agentic across the board and making a sweeping goal?
Russ d'Sa
>> So I talked to my team about this a little bit, and there's this video that's been shared around on social in the last few months of Jensen, the NVIDIA CEO, for those not familiar, but I think everyone's probably pretty familiar, but Jensen has this video from the early days. I think it's like the early 2000s where he's talking about how he doesn't have to change the world in a year. He's going to change it in 50 years. And the way you do that is you have this multi-phase approach to doing so. And each phase is focused and simpler and easier to digest. And when it's simpler, it's easier also for your team to execute on flawlessly. It's very hard to boil the ocean, and so you kind of carve out your vision of where you're ultimately trying to get to across a bunch of stages. And so for us, the next 12, 18 months, it's really laser focused on voice AI. I think that as these AI models improve, the way you interact with AI is going to be much more like you interact with people. And so how can we support all of those applications? I think a thing that was underappreciated by us, but I think that most people and even most developers out there don't realize, is that we started off as network infrastructure to stream the voice data back and forth between human and the AI model. But it turns out that when we started to work with OpenAI and deploy Voice mode and scale it up, we realized that the entire development life cycle for an application that you can interact with like a person, the entire development lifecycle has to change, every stage of it. You can't build the application the same way, test it the same way, deploy it the same way, run it and scale it the same way, and observe it the same way. Every part of that entire iteration loop has to be redesigned for a computer that can be interacted with like a person. And so it turns out that you can't build the application the same way, test it the same way, deploy it the same way, run and scale it, or observe it the same way. So for an application that you can interact with with a camera and a microphone, like a human being, it can see you, it can hear you, it can speak back to you, everything underneath in the way that application is built and scaled and run and observed has to change. And so what we're focused on for the next 12 to 18 months is really just building all of those pieces. What we want to do is we want to make creating a voice AI application as easy and as familiar as building a web application is for people. Today, that's something that we've been doing for 30 years. Everybody knows how to do it and how to move really fast on these web applications. How do we create a platform that makes that possible for this new type of application? And so every single piece end-to-end is what we're really focused on for the next 12 to 18 months. After that comes the next phase, which is like, well, these applications are going to run on different devices. And so how do you make sure it transcends every device? And these applications are going to run for long periods of time. We see what's happening with OpenClaw and these other kinds of agents that are autonomous. Well, how do you go from just synchronous conversations with a LiveKit agent to autonomous agents that are built on top of LiveKit as well? And then what ends up happening is you have autonomous agents and they can talk to you. Eventually, you go from those agents being able to just manipulate bits on the internet to suddenly like, well, if I put that agent into a humanoid robot, now that agent can actually manipulate atoms. And so it can do work in the physical world and perceive the world the way that humans do. It turns out that digital AI has about a 50% overlap with physical AI, but even for physical AI, you have to build a lot of new stuff that are specific to that kind of domain. And so I think eventually over time, what you'll see live could do is evolve from agents that you're talking to agents that can do all kinds of things and interact with computers in a very similar way that humans can, more generally speaking, and then towards agents that can roam the physical world and manipulate atoms and do work out in the physical world, like your dishes or build houses, or et cetera. It's going to be pretty exciting once we get to that kind of world.
Gemma Allen
>> Well, as much as I love the idea of a humanoid unloading my dishwasher or maybe even helping me raise my children, I think another thing that Jensen says is you focus on the fundamentals. And it certainly sounds as though it's exactly what you guys are doing at LiveKit. So congrats on all your success and being part of the AI Agent 100 list, and hopefully see you in New York, Russ.
Russ d'Sa
>> Thank you so much, Gemma. I really appreciate it.
Gemma Allen
>> I'm Gemma Allen, coming to you from theCUBE Studio here at the New York Stock Exchange. This is part of our program with NYSE Wired. We are talking all things AI Agent Conference happening here in New York this coming week. Thanks for watching.