AWS re:Invent 2025 | Scott Stephenson, Deepgram

Clips
More from AWS re:Invent 2025

Scott Stephenson

CEO

Deepgram

Real-time AI emerges as artificial intelligence hits a new performance threshold

The shift toward advanced artificial intelligence is accelerating as real-time AI capabilities reshape how interactions, decisions and workflows function in everyday systems.Enterprises now expect systems that respond quickly, understand context and feel natural to use, raising the bar for accuracy and speed. That push fuels deeper conversations about latency, multimodality and the infrastructure needed to keep AI moving at a truly human pace, according to Scott Stephenson (pictured), chief executive officer of Deepgram Inc.“Context is key,” Stephenson said. “There’s a lot of basic functionality and models now, like speech to text, text to speech, text to text, like a language model

play_circle_outline Introduction to AWS re:Invent 2025 and theCUBE's coverage of the event.

play_circle_outline Unlocking Real-Time AI: Enhancing Customer Engagement in Call Centers, Meetings, and Healthcare Through Bidirectional Streaming and Reduced Latency

play_circle_outline Deepgram's partnership with AWS and integration into their services.

play_circle_outline Role of research teams in driving innovation and frontier capabilities in AI.

play_circle_outline Future focus areas for Deepgram and continued partnership opportunities with AWS.

Info
Transcript

Scott Stephenson, Deepgram

Scott Stephenson

CEO Deepgram

In this interview from AWS re:Invent 2025, Deepgram Co-founder and Chief Executive Officer Scott Stephenson joins theCUBE’s John Furrier to discuss the critical evolution of real-time, multimodal AI. Stephenson details Deepgram's major announcement regarding bidirectional streaming in Amazon SageMaker, a development designed to reduce latency to under 100 milliseconds for seamless human-machine interaction. The conversation highlights how context is becoming paramount in AI performance, moving beyond basic transcription to understand environmental cues and vo... Read more

explore Keep Exploring

What event is being covered and who is the host? add

What is the significance of real-time workloads in the context of AI and how might they evolve in the future? add

What factors influence the choice between using Deepgram's models and Amazon's products for specific customer needs? add

What approach is being taken to advance research and development in speech technology? add

What initiatives or developments are being pursued in the partnership with AWS, particularly in relation to AWS Health and customer service? add

bolt Powered by CUBE AI

Scott Stephenson, Deepgram

search

John Furrier

>> Welcome back, everyone, to theCUBE's live stream of AWS re:Invent 2025. I'm John Furrier, host of theCUBE. This is day two of three days of coverage. It's our 13th year covering re:Invent. We've seen the wave from cloud computing, and now, obviously, the AI and the data, abstracting the way kind of work. And then, the new apps that are coming on, it's really been a great, great show. Key announcements around the models and then how people are building apps. Scott Stephenson, the co-founder and CEO of Deepgram is here. They're doing some pretty cool things. Scott, thanks for coming on. Day two, we're surviving. All day yesterday. How are you holding up?

Scott Stephenson

>> Thanks so much for having me. It's been great to have some of the announcements come out with Deepgram and SageMaker with bidirectional streaming. It's also exciting to see all the multimodal announcements and everything. And I'm holding up all right. It's a lot of walking, but it's been very good.

John Furrier

>> Talk about the news that you guys had with AWS, because obviously, Matt Garman laid out the big stuff and then they had the follow-on keynotes. Really the big push, you can see, obviously AI factories aligns with the AI infrastructure wave and CapEx. But a lot of the action into end-to-end multimodal models coming together. Talk about where you guys fit into some of the news. You guys had some great alignment with AWS. Just give us a quick... the news hit.

Scott Stephenson

>> Yeah, context is key. So, there's a lot of basic functionality in models now like speech-to-text, text-to-speech, text-to-text, like a language model. But in order to get those models to work at the highest level possible, you want to include context. So, like in the context of audio, are they calling from a car? Are there children yelling in the background or whatever that tells you something about the conversation that just the speech-to-text doesn't tell you? And these are multimodal ways of thinking about things. You're using audio, you're using text at the same time. They also announced images and you can use this for video as well and all these pieces give you more context. And one of the big announcements for us is bidirectional streaming in SageMaker. And this is big because right now, most AI that people... like the workloads that people have put out there into the world are actually more like an interactive or batch mode workload. And there's more for LLMs where you load all of the context in at once and then the output streams. And for voice though, you can't wait to load all the context in at once. You need to have an active conversation like we are now. I'm responding to you, we're doing all this in 100 milliseconds or less. That means that you need to stream in and you need to stream out. And so, this is a big announcement for us that I think will be a primitive that's used for a decade plus.

John Furrier

>> Yeah, I love the primitive angle. Just explain more detail around the importance of the bidirectional because I think that's worth double-clicking on because where does it impact the app? Obviously, SageMaker has been great for people who are going to go under the hood and play around and obviously the developers love Bedrock, but talk about the impact. Where is it going to be impacted in the advancement? What specifically can you share? Is it just for call centers, Connect? I mean, where does it land and then where does it extend out? What's the headroom look like?

Scott Stephenson

>> Yeah. Well, I think voice is the first real-time workload where real-time is super prevalent in that workflow, but that'll be big first. So, that means call centers, it means business meetings, it means real-time transcription in hospitals in order to better deliver care, et cetera. But I think real-time AI is just an overall much larger category. And most AI now is not real-time, but we'll probably move to... It's five years, 10 years from now, we'll be having the same conversation and we'll say, "Man, 80% of tokens are real time now." But the importance of that is that you reduce the latency. So, like how we're speaking right now, I can respond to you, you can respond to me. Everything is happening in real time. It's streaming into my brain and it's streaming out of my mouth. And if I had to wait...

John Furrier

>> Yeah, it'd be awkward and slow....

Scott Stephenson

>> and then finally respond, then it's like this doesn't work. So, that's the big thing is like, "Man, do you want to wait for all of this?" And it breaks the realism of how everything works. And it's the fastest way to churn a customer in a real time interaction is if you give them that wait.

John Furrier

>> It's interesting, Scott, because you're bringing up something that's come up all day yesterday and previous interviews I've done where the latency on real time versus, "I could wait." I mean, it came up even with some of the S3 conversations about vectors and tables would say, "Hey, okay, I'll wait. I mean, I don't mind waiting if you're going to do a task, but on certain things, I need to have specific SLA on latency." And how's that going to impact, say, the average developer? I mean, take me through how you guys engage. Do I just use it in the cloud? Are you going to market with Amazon? How do I use Deepgram?

Scott Stephenson

>> Yeah. So, we're big partners with Amazon, love working with AWS. And the big thing with, you probably hear it over and over is that customer choice is a big thing for Amazon and they really take that to heart. And so, in certain situations, Deepgram's models and underlying platform are the best for that situation. It's typically real-time, low-latency. When you need really high accuracy, then using Deepgram is probably a great choice. But there are other products that Amazon has released, like the Sonic or like Transcribe, et cetera. And in certain situations, those are going to be the right choice for the customer as well. But for us, it's usually that super high volume workload, the low-latency, you need the ability to make the model really good at your challenging acoustic environment, then Deepgram is going to work really well for that in real-time.

John Furrier

>> So, you guys optimize really on the front-end, on the voice AI piece and look downstream for Nova and other things to pick up?

Scott Stephenson

>> Yeah, we do perception, so speech-to-text, but we also do text-to-speech as well, and they're both contextualized. So, just like if you wrote down a sentence and you tried to force a text-to-speech model to say it, actually, if you don't give that text-to-speech model context, then it has to say it in a monotone, even way. If you just say the word great, if you just write the word great, you could say, "Great," or, "Great," or, "Great." They all have different contexts. They all have different meanings, right? But we offer text-to-speech. At Deepgram 2, you can speed it up, you can slow it down, et cetera. So, it's both sides of the conversation.

John Furrier

>> I mean, we're streaming these conversations to the cloud and pushing through YouTube and all these channels. I mean, we could literally pick up Deepgram and say, "Hey, optimize the transcription and then call other functions and do other things." Well, who else has done conversations like this?

Scott Stephenson

>> Yeah, it's cutting edge. If you look at the closed captions on TV, the news or whatever, they're typically like delayed 10 seconds, 30 seconds, that's all going to be gone. It's all going to be replaced.

John Furrier

>> Yeah, I always hated that. I want to ask you about context, you brought that up. One of the things that's come up with some of the S3 conversations with some of the engineers was text and, well, voice goes to text. There's also metadata that makes it bigger because words are words. I mean, but if you have context like vector embeds, it's come up with kind of, I won't say fat or more bloated, but like more content. How is that impacting some of the tech? Is that something that you guys think about in terms of when you do this text or is it not an issue?

Scott Stephenson

>> Well, there's a topological effecting. So, basically, do you accept more metadata or not? And there are many models that don't, okay? But the world is definitely moving to this multimodal, more highly contextualized type of model. So, it might be a perception speech-to-text model that normally accepts audio, that's how people think about it, but it'll also accept CRM data. So, it knows how to spell your name, it knows where you live, it knows all this other information, so it can do the perception job even better than it did before. And I think that is a trend that is going to continue. And yeah, it's all going to be combined together. We have a neuralplex architecture where the entire idea is that context is flowing and being stored everywhere. And it's definitely going to increase the size of the data that is stored, but most companies are doing this in an intelligent way where they shrink what that context is, so the most meaning is included in that context. So, you don't want to bloat what goes into your LLM by 10X in order just to give it to context, right? You want to pay a little bit of tax on it, 10%, 30%, something like that. So, "Hey, can you compress that down?" And that might be through vector embeddings, rather than the actual human-interpretable text or other methods.

John Furrier

>> Yeah, that's awesome. Thanks for explaining that. The other thing that's come up when I interviewed Matt Garman before re:Invent, he made a quote, he said something to the effect of, "Chatbots aren't about toys. It's about the system." Talk about the conversational AI. I'll put that in quotes because it's been kind of a category, chatbots. And obviously, that's a user experience. Hey, I love talking to the AI system, voice is definitely an interface. What's the difference? How do you look at the system view? Because now with agents, you got to think about a system workflow, not just a gimmicky reasoning chatbot with a knowledge bank or a knowledge system.

Scott Stephenson

>> Yeah, and the chatbot is just one function in that bigger system and you could have multiple agents working together, one that you never talk to, but it's that agent's job to go organize your database or whatever it is, right? And so, yeah, I think it's a new world, but you can also just look at it a lot like human systems. Whenever we're trying to accomplish a goal, we have a go-to-market team, a product team, we have a research team. On the research team, there are managers, there are others, et cetera. And then, we also use tools like computers, like others. And then, we call those tools, we use those tools. It's all going to be the same. It'll actually probably be managed in a similar way because we have to form it in a way that humans understand and we understand human organizations, and so then that's how we go-

John Furrier

>> Well, I love what you guys do. Love the multimodal. I love the speech-to-text. I think it's going to be a killer app. Obviously, we've been saying that, everyone's talking about it, but share your journey. You're the co-founder. What's the stats? How big are you guys? Where are you at in the progress? What are some of the coolest things you're working on? Talk about the company a little bit.

Scott Stephenson

>> Yeah. Well, so we've been around for 10 years and we started in the cohort with OpenAI. OpenAI worked on text, we worked on audio. We were at YC with OpenAI. Sam was the CEO. He was our group partner.

John Furrier

>> Yeah, that's cool.

Scott Stephenson

>> But we focused on audio and bringing end-to-end deep learning to the audio world and then bringing scale to the B2B world. And we released our first product in 2018. And since then, it's been high growth mode, but we have a research team and we have a product team and we have a full strategic infrastructure team.

John Furrier

>> How many people on the team?

Scott Stephenson

>> It's about 150 people.

John Furrier

>> Awesome.

Scott Stephenson

>> We've raised publicly about $100 million, more news to come there. But yeah, I started out as a particle physicist. I built deep underground dark matter detectors. I was like in a James Bond layer. I'm not kidding. It was like yellow railing, the crane, people grinding sparks flying everywhere, et cetera. And I was in a government controlled region of China right next to where they launched their satellite facilities. We bought our liquid nitrogen from them, et cetera. It was a really interesting time. But what I thought was, "This is so cool. We're two miles underground doing this. Why isn't there a documentary crew here?" And so, we built this little device to record audio all day, every day, 24-7. Just a Raspberry Pi with a microphone and SD card. And then, it actually uploaded to S3. So, that was my first experience with AWS is maybe 14 years ago, something like that, saying, "Hey, we need somewhere to store this, but we need it to be cloud connected. So, we'll just dump it into S3 and then we'll accrue all this data." Well, we got all this data and then we thought, "Okay, we recorded our lives, but we're not going to listen to our lives again. That doesn't make any sense."

John Furrier

>> No one wants to replay it.

Scott Stephenson

>> 24-7-

John Furrier

>> Give me highlights.

Scott Stephenson

>> Exactly. So, "How do we make a highlight reel?" That was the question, right? "How do we make a highlight reel? How do we search to find those?" And long story short there, we searched the world and we couldn't find a good enough solution. We even went to Microsoft, Google. We met with their speech team and we said, "Hey, we have some weird data, just some scientists underground. We'll give you this data, but can you give us access to your next gen end-to-end speech deep learning based speech system?" And they're like, "That'll never work. We tried it over and over. N2M deep learning is never going to work for conversation." And we're like-

John Furrier

>> "What?"

Scott Stephenson

>> "Wha?" Yeah, but this is 2014-2015, and it was all these people who were brought up on different types of systems, not end-to-end deep learning systems.

John Furrier

>> But you had basically had a corpus, an unstructured data corpus. You had no dogma, just like, "Hey, I got audio."

Scott Stephenson

>> Yeah. And we had the physicist mindset of, "No, I think this is possible." And we got a very similar system working in physics. It was basically like a speech recognition system, but with a 10-word vocabulary and it was an end to end deep learning system and it totally worked. And it did the equivalent data of about 100,000 streams concurrent for voice data. So, we're like, "No, this is totally possible. I don't know what you guys are talking about." So, we couldn't get off the call fast enough with them basically and say, "We need to start a company. These people have no idea what's going on."

John Furrier

>> Yeah, exactly. So, I mean, that's an interesting time. If you could look at like say 2015 and then it was 2018 when OpenAI started scratching the surface. I mean, it wasn't really on anyone's radar. It was just a bunch of folks just jamming on things, mostly research. And then, some of us were like, "Okay, my friend just went to OpenAI." And in the Valley where I live was like, "Okay, there's cool shit some going on there." And it's like, "Okay." Then it's like, "Wow." So, that was an interesting time. What was it like? Because I mean, it's like a revisionist history, but it was grinding back then, and then COVID hits.

Scott Stephenson

>> Yeah. Nobody was buying, but there were a few believers. So, if you put a blog post out and lots of people would read it and say, "That's really cool." If you're a founder, you don't like when people say, "That's really cool." You want people to say, "I want to buy that. Here's my checkbook." That's what you want to hear, but you hear a lot of, "That's really cool." But then you match that up in the future market size. Now, I would say voice AI is a trillion dollar market.

John Furrier

>> Yeah, huge.

Scott Stephenson

>> And we saw that from the beginning. It's like, "Wait a minute, when you get all this stuff working, it's going to be this huge market." And so, anyway, we just had that belief. And I think the physicist mindset helps here because physicists timescales are like, we might work on something for like 20 years, 30 years, and it might just be like, "Eh, it didn't work." And so, you're in this monk mindset where you're like, "I don't care. I'm getting paid $24K, $30K a year, and I'm just going to keep pushing the ball forward because this is so cool what we get to do." I like to think about what we get to do at Deepgram in that way. We don't have to do it, we get to do it, we get to do this. And so, anyway-

John Furrier

>> And you're doing it when no one really is paying attention and you're kind of misunderstood. And then, all of a sudden, boom, everyone wants voice AI.

Scott Stephenson

>> Yeah. And it's the 10-year overnight success where you're like, "Oh, okay, everybody needs it. They need to scale it. They need it supported across 100 different languages. They need it in their VPC. They need it adapted to a different use case," all of this. And then, they're like, "Where do I get it?" And it's like, "There's only one place. It's Deepgram."

John Furrier

>> I think I'll do a documentary on that 2018-2021 timeframe, very cool things happened and controversial things. But I think what's cool is that you guys were coming at it from a science standpoint with research. And I want to ask you about your current research team because this is a pattern I'm seeing where the successful folks aren't thinking, "Oh, I nailed product market fit. It's always iterating." And they have research teams and it's not your classic applied research or, hey, solve the world problems. It's actually directed its intentional research. Share your vision on how you think about the research role in Deepgram and the high velocity that's coming on. So, it must be... I mean, it's kind of intentional. Yes, it's applied, but it's not like the division sponsors some research and how to make the product better, it's more breakthrough thinking or... Share your thoughts.

Scott Stephenson

>> It is. Yeah. So, you do both at the same time where you have frontier teams that are working on things that we think that in the next two years will be the frontier. And so, now we're just trying to pull that forward in time, saying, "How do we get that in three months? How do we get that in six months?" And then, we're looking toward the future where everything is contextualized, everything is really low latency, real time. All of the agents are using other models, other tools, et cetera. But then you say, "Okay, that's the vision." But we pull back and say, "What can we move forward in time with our research team today?" And yeah, we're the ones who brought end-to-end deep learning to the speech world. Deepgram was the first company to productize end-to-end deep learning models on GPUs. At the time, image models, any type of text auto-correct or anything was all done on CPUs. GPUs were thought of as too unreliable. And we're like, "No, no, no, guys. Come on, GPUs are the future."

John Furrier

>> It's math. Let's go. It's matrix multiplication.

Scott Stephenson

>> Yeah. We have to figure out how to make it reliable. And so, actually NVIDIA was a great investor in Deepgram early on, one of our seed investors and we worked with them closely. But with that in mind, this needs to be a production-grade system. And so, we are going to be the first to roll that out into the world because audio has this real-time use case, you need to take advantage of the GPUs. And so, let's just build it. And I'm willing to wait. At the time I would have predicted seven years for the voice revolution to happen. It took about nine. So, hey, for a physicist, I'm like, "That's a pretty good error bar right there."

John Furrier

>> Yeah, that's a blink of the eye right there.

Scott Stephenson

>> Yeah, exactly. And so, we were already signed up to do it for a long-term anyway. And so, now I'm just like, "Look at what's happening. It's so great. AI is going mainstream."

John Furrier

>> It's a great story. First of all, it is a killer. Everyone knows that's the interface, you can just see all the evidence there. You connect the dots. You mentioned frontier and the term frontier model. Now, Amazon's using frontier agents. It means something to be a frontier model. Explain in your mind what frontier means because there's a lot of people slicing the salami on the definition like, "Okay, it's more a bleeding edge or it's more reasoning versus say a model." What does frontier mean?

Scott Stephenson

>> Yeah. So, I think a lot of folks will look at that too on the angle of how many parameters does it have? It's a frontier model because it's the biggest or something. There are multiple frontiers here. And so, you could say I have the biggest model and that is a frontier model. I'm pushing into that regime. That's actually an easy one to do. Anybody could set way too many parameters-

John Furrier

>> . Yeah.

Scott Stephenson

>> Yeah, but really how I think about frontier models is are you bringing a new capability? So, for instance, we just released a model called the Flux model, which it does the job of speech-to-text, but it also determines, is it my turn to talk or is it your turn to talk? Are people talking at this time or not? So, this is called voice-activity detection and end-of-turn detection, but that's now all fused into that one model that used to do speech-to-text. Now, it has expanded this. And this is the first model of its kind to do that. It's the first to incorporate context being injected into a perception model. And so, I think of it like when the research team says, "Hey, we need to do this thing. We're hearing all this from the world, that the world needs this." Do they go look at other papers or do they think from first principles? Because the other papers where people have already tried it and you're like, "Okay, we're just going to implement something that they're doing," that's more applied research. When you're doing frontier thinking, no, you're setting up all the experiments yourself and you're saying, "Okay, it might be this path, it might be that path, it might be that path."

John Furrier

>> You plowing new ground basically and doing new things?

Scott Stephenson

>> Yep. Yep. Yep.

John Furrier

>> Capabilities is really kind of the core.

Scott Stephenson

>> Yep. You're pulling out the trees, you're burning the brush, you're flattening the land, you're plowing the field. You're doing all of it.

John Furrier

>> I think we could go another hour just riffing on this, but I want to ask you one question because in the 80s when I was playing around with ontologies in my old life CS degree, AI was theory, but ontologies were built with taxonomies and word combinations. Linguistics was a big focus. What's the words, is apple a fruit or a company kind of thing? Voice is a unique linguistic attribute versus say text. You got accents, you've got all kinds of context. What's your vision on this? Because I think this is an area that I haven't really heard a lot of people talking about the value of voice linguistics because it is language, language models. I mean, so it's not just written text. What's your vision on this? I think this is going to be an innovation area.

Scott Stephenson

>> Yeah. Even just a single word, I wish I had an audio file to play you because I'll do this when we're presenting sometimes, but even just a single person saying the same word, "Hello," they can say it in many different ways, right? But then you put them in different acoustic environments, different accents, different ages, different everything. There's so much more information packed into a single word in audio compared to written down in text. And I think what is unique about this age of AI is that you don't have to be the smart one to develop the heuristics for the ontologies, for the way of organizing it. What you do is you create the sandbox and you feed it the right data, you feed it the right underlying model architecture, and then you feed it the right curriculum and training to... Now, the models are like building their own way of thinking that way. And this is actually the unlock for end-to-end deep learning. This is the original insight. If you rely on people, like our research team or any other researcher to come up with all of the insights, actually, that's the reason that Google and Microsoft are saying, "This is impossible." They-

John Furrier

>> They tried to get the answer.

Scott Stephenson

>> Yeah. It's like I'm going to hard code these things in. It's like, no, no, no. You have to build the system so that the data teaches the model what to do, basically. And we're just in the beginning stages of that still. There's an agricultural revolution that lasted maybe 1,500 years. There's an industrial revolution that lasts maybe 250 years. There's an information revolution, which is just ending now, that lasted about 75 years. That's storing data, transferring it at the speed of light across the world, that type of thing. That's the information revolution, Google's part of that, et cetera, right? But now, we're in the intelligence revolution and it's a totally different game. And we see our customers, like Salesforce, like Decagon, like Vapi, all these companies, they have thousands of customers and all of them have different situations. And so, you need to build a system that is able to identify that and say, "Oh, in your situation, the model should see more data like this and then it will be better over here." But you want a totally-automated system to do this. And I think the real-time constraint really helps here too in figuring out the differentiation because if you say, "Well, can't just a huge model do that?" And it's like probably it can, but it can't do it in 100 milliseconds. And so, you have to build a system that can do all of this in a very real-time fashion. And then, that separates the real time companies versus like the interactive .

John Furrier

>> Yeah. And that's where the bidirectional streaming comes in?

Scott Stephenson

>> Exactly.

John Furrier

>> All right. Well, I really appreciate laying down that knowledge, that was awesome. Thanks for sharing that. Just put a plug in for what you're optimizing for. What's your focus? Also, you got the primitives out. I'm sure there's more development. What's the plan?

Scott Stephenson

>> Yeah. So, we are continuing to deepen our partnership with AWS. There are so many opportunities in AWS Health with SageMaker for the primitives, but also, in the AWS customer service space and Connect. So, we have an integration with Connect where you can use Deepgram for speech-to-text, text-to-speech. And we see a lot of excitement around that because, just like the AWS and Amazon mindset of giving a customer choice, in that real time use case, Deepgram fits many use cases.

John Furrier

>> It's beautiful.

Scott Stephenson

>> And so, I'm really excited about that.

John Furrier

>> Yeah. So, building block is primitive too, it's like you got both. Well, Scott, thanks for coming on. You're certainly trailblazing. Love the story. Love what you guys are doing. Congratulations. Thanks for coming on.

Scott Stephenson

>> Yeah, it was great speaking with you and thanks for having me.

John Furrier

>> All right. Cool. We're going to get this speech-to-text more real-time. We're almost there. We're doing our best to bring the conversation here on theCUBE. Thanks for watching.