theCUBE + NYSE Wired: Zero Trust Cyber Series | Sherry Marcus Ph.D., AWS

Clips
More from theCUBE + NYSE Wired: Zero Trust Cyber Series

Sherry Marcus Ph.D.

Director, Applied Science

AWS

play_circle_outline Exploring the Science of Bedrock and AI Models: Model Distillation for Accuracy, Latency, and Cost

play_circle_outline Amazon's Nova models for cost-effectiveness

play_circle_outline Agents in AI and data harmonization

play_circle_outline Potential of agentic AI for productivity and growth

play_circle_outline Internal use of Amazon's technology for external education and adoption.

Info
Transcript

Sherry Marcus Ph.D., AWS

Sherry Marcus Ph.D.

Director, Applied Science AWS

Dr. Sherry Marcus, Director of Applied Sciences at AWS, leads science for Bedrock, AWS's GenAI service offering various models for building GenAI applications. The focus is on providing accurate, fast, and cost-effective models through innovations like model distillation and guardrails. Customers can evaluate models using features like Model Evaluation and Knowledge Basis Evaluation to choose the best fit for their needs. AWS aims to provide cost-effective options like Nova, delivering accuracy. The company's ability to train models using its infrastructure e... Read more

explore Keep Exploring

What are the three common elements that customers want in models and applications according to your science focus? add

What is the reason behind building the Nova class models and how do they compare to competitor models in terms of both cost and accuracy? add

What is the challenge in harmonizing data for agents to deliver on their promise and how is this problem being addressed internally and externally? add

What industries have shown promise in using agentic systems, and what potential capabilities still need to be realized in this technology from a scientific and industry perspective? add

What type of strategy is Amazon using to apply its technology internally before offering it externally? add

bolt Powered by CUBE AI

Sherry Marcus Ph.D., AWS

search

Dave Vellante

>> Hi, everybody. Welcome back to day two of our media week NYSE Wired and theCUBEs Community. This is our AI Innovators and Cyber Week. Dr. Sherry Marcus is here. She's the director of Applied Sciences at AWS. Sherry, great to see you. Thanks so much for coming in.

>> Thanks. It's great to be here. What a great spot.

Dave Vellante

>> Yeah, it's amazing, isn't it? We're just coming off of re:Invent. We got the firehose of content. You are focused on Bedrock... Tell us about your role so we can frame the conversation.

>> Yes, so I lead science for Bedrock. Bedrock, as you know, is AWS's GenAI service where customers can choose from one of many models to build out a GenAI application. My team's role is building out the components of these applications such as guardrails, agents, RAG, distillation, and intelligent prompt retrieval amongst many other things.

Dave Vellante

>> So a lot of that stuff got announced this week. So I want to start with Bedrock. So people think of Bedrock. Okay, just a place to get models. It's like a model garden, but it's much more than that. What's the science behind Bedrock? You mentioned several sort of innovations, agents, distillation, model distillation, we heard this week is now part of that. Guardrail is super important. Let's get into that a little bit. Tell us about the science.

>> Yeah, so our science is really focused around three common elements about, which customers want. Customers want models that are highly accurate. Customers want models and applications that are fast, have very good latency, and customers want applications that are cost-effective. And so the capabilities that we build using our science tries to minimize or maximize those three different functions. So for example, distillation is about using a large model and teaching it all of its knowledge to a smaller model for given types of tasks. And therefore customers are able to use these small models and get the same performance or accuracy of the larger model, but these smaller models are faster, so they have lower latency and they cost less.

Dave Vellante

>> Your premise as a company is that you've got to have model optionality, there will not be one model to rule them all. You're right. Customers want accuracy, they want speed, and they want it to be cost-effective. And I think I'm inferring there are trade-offs, obviously between those.

Dave Vellante

>> Yes, there are.

Dave Vellante

>> If you want the model to be perfectly accurate and the highest performance, you're probably not going to have the most cost-effective model. But then there are ways to, as you say, distill models. And so you've got to have a broad selection. In fact, we saw that with Nova. I was struck by how many options of Nova there were. It was actually somewhat overwhelming. And when you combine that with all the other models that you have, whether it's Anthropic or Mistral, et cetera, et cetera, so many choices. How do you help customers decide what's the right fit, right strategic fit for their application? Is that part of the science?

>> Yes, it is. And I'm so glad you mentioned that because we also released new features at re:Invent called Model Evaluation and Knowledge Basis Evaluation. And so with these new features, customers can take any model on Bedrock and evaluate it for their specific features and functionalities that they're looking to come out of, they're looking to produce. And as such, they can take, for example, 10 or 20 examples against all of these models and figure out which models has the highest accuracy or the lowest cost, and then they can start building out different types of applications.

Dave Vellante

>> Model distillation's interesting, if I understand your description, you take a large model and then you're able to distill it down to a smaller and smaller model, like I'm imagining a chef reducing to get the fine ingredients. I've read papers, there was a paper by OpenAI I think earlier this year that talked about smaller models being able to prompt and train larger models. I wasn't sure if that was going to... It didn't kind of intuitively make sense that a smaller... But then I thought, "Well maybe it actually can prompt better than a human." From a science standpoint, are you seeing... I guess my question is, how is the interactions between models advancing the efficacy of models?

>> So what I'm seeing is that larger models are essentially being used to train smaller models to be more performant for given tasks, and that that's the trend. And the way it works, by the way, it's just supervised learning. You take a set of prompts and responses that is generated by the teacher model, and then you use that to instruct the smaller model over time. Now what our secret sauce is in Bedrock is we use something called data augmentation where we're generating more prompts and responses from the teacher model to train the smaller student model and getting better accuracy as a result of it. Now your point on OpenAI, you can use very well-fine-tuned smaller models to train larger models, but there has to be a business case in order to do it.

Dave Vellante

>> Yeah, it sounded like a bit of a science project, but as a scientist, it's okay to do science projects sometimes, but you're an Amazonian, so there's got to be a business case around it.

Dave Vellante

>> Exactly, exactly.

Dave Vellante

>> I get it. Can you talk scaling laws with me at all? Is that something that's in your swim lane? Can we chat about that a little bit?

>> Sure, sure.

Dave Vellante

>> Okay, so let me frame it and then you can course correct. So I think many people are familiar with scaling laws, compute, you've got to have data, you've got to have parameters, which are the weights and biases, try to scale one without the other two, you get diminishing returns. So if you really want to get results, you've got to scale all three together. So my understanding now is the world is waiting with bated breath to see what comes out of Colossus With this a hundred thousand GPU clusters, will the scaling laws hold? We're all waiting with bated breath. Nobody thought it could be done, now all of the sudden, pre-Blackwell we're seeing some potential here. How should we think about scaling? Some people say we're running into limits. Others say, "Oh no, we're just getting started," what's the real story?

>> Well, the first observation about neural networks and scaling laws actually comes from someone in Andrew Ng's lab at Stanford and how he founded Google Mind, Google Brain, which was he noticed that the larger the neural network, the more accurate the results are. So let's scale these neural networks. Now they're large language models, bigger and bigger, to get these more fantastical results from larger and larger models. The answer is, I think there's more out there. I think with the Blackwell, we're going to be able to get more accurate results from larger and larger contexts. I think there's a lot of room to grow there. I think though that the market has to absorb the models that we have now as well and really understand how to use it. So I think the message I want to send is that the science is way outpacing what industry is absorbing, as is to be expected.

Dave Vellante

>> I have another question, and again, if I'm out of your swim lane, let me know, but you've got a pretty wide observation space. Think about Nova, you think about, okay... I guess my question is, why does Amazon... It's got so much optionality, it's made big investments and commitment to Anthropic, why does the world need a Nova and why does Amazon need to be in the model building business? There's got to be some kind of business justification there. Is that something that you can address?

>> It is, and it was the number one question that I was asked at re:Invent as well. But look, customers have come to us and says, "We want models at cheaper prices. They're very expensive to run at scale." And we heard them. And so as a result, we built the Nova class, which is on average 75% cheaper than other competitor models on Amazon. And as you see from the results, they're about just as accurate. So the message really is more for less on these Nova models and it's something that historically AWS has always done, give the customer more capability at less cost.

Dave Vellante

>> Well, we had Andy Jassy on at re:Invent and we were having a side conversation with him about this and it struck me, you know how Amazon likes to be, it's comfortable being misunderstood for long periods of time. And I think the last two years Amazon's been misunderstood, but with re:Invent, it became crystal clear. You basically taking what you did with Graviton... Graviton is to X86 as is Trainium is to big NVIDIA GPUs, so you're giving lower cost options there. And based on what you said, and we kind of got this I think right, I think is that what you're doing with Nova is similar, you're taking that same mindset of, "Hey, we can develop IP internally that will lower customer's cost and still preserve their options.

>> Exactly. Exactly. It is the more for less philosophy. I would say that Graviton was built on an arm architecture whereas the Trainium series was new, but the analogy still holds.

Dave Vellante

>> Oh actually I didn't know that. Trainium's not arm-based.

>> No.

Dave Vellante

>> I never knew that. And Inferentia is as well?

>> That's right. To the best of my knowledge.

Dave Vellante

>> Okay, hmm. I maybe missed that. All right, we'll check that. But it's still coming out of Annapurna, Annapurna Labs.

>> Yes, yes. Yes, yes, yes.

Dave Vellante

>> One of the greatest acquisitions in the history of tech. If you don't know Annapurna Labs, check it out. So, so much is happening in this space, I want to ask you about accuracy and guardrails. So sometimes they're counterpoised, how do you square that circle?

>> That's a great question. So let me give examples of how customers use guardrails to kind of explain the accuracy. A lot of customers use guardrails for denial of topics, meaning that they only want their customers to ask questions about what's relevant to their business. So if you're an airline, you don't necessarily want customers asking, "What's the latest news?" So we restrict topics only to specific areas. Now what comes out is that sometimes language is imprecise, topics overlap and you may get hit where a customer asks an airline a question about the weather and it could inadvertently construe that as not relevant to the topic. And so what we do is the customer has all of the transactions in their own audit in CloudWatch, and they're able to continually tune and train the guardrails so that you maximize the accuracy.

Dave Vellante

>> The other interesting thing I wanted to ask you, again, this came up in the Jassy conversation, is I think it's pretty clear that you guys are in a position to train models. I think you mentioned this with Anthropic, I think you said hundreds of thousands of GPU clusters, you'll be able to train on Amazon infrastructure. So we're talking about Amazon's backend network and the associated tooling around it. And you've been working on that for a while. This again goes back to the misunderstood. So I think, I'm right, and you can do that without InfiniBand, you're doing it with your own Ethernet network or whatever you call your network. I forget what the brand name is, but it's your IP.

>> Correct.

Dave Vellante

>> Again, lower cost of not only better margins, but lower cost for the customer as well. You can pass that on. Did I get that right?

>> You did, you did.

Dave Vellante

>> That's big.

>> It is big. It is big. It is big.

Dave Vellante

>> So I'll say it this way, because you have Trainium, and I guess Inferentia2, you were able to, over the past couple of years, get on the S-curve, the learning curve of the tooling and how to train without being a hundred percent reliant upon NVIDIA infrastructure. Others were more GPU constrained, so they had a dole out GPUs, they weren't able to apply probably as many use cases. So it brings me to agents, because we see a lot about, particularly Microsoft, a lot of co-pilots, single co-pilots. You have Marc Benioff talking about Clippy, which is fun. But the point is Microsoft, and you won't talk about Microsoft, but I will, they were constrained on GPUs because they don't have the experience that you guys got starting in 2015 with Annapurna. So you are able to more widely apply the technology and learn more. So it brings me to agents, which is something I know a lot about. It seems to me there's some real, there's a lot of agent washing going on. There's some real gaps in what has to happen in order to make agents deliver on the promise. One being the harmonization of the data. And I'd like to really push you on this, Sherry, because I feel like Amazon customers have data in the cloud and they have probably better data than most, but I feel like it's still disparate. It's across multiple data stores and it's still, I would think a challenge to harmonize. How are you solving that problem internally to support your agentic frameworks and how are you helping customers solve that problem? Does that make sense?

>> It does. It does. I have a different perspective.

Dave Vellante

>> Please.

>> The secret sauce behind making agents work from a science perspective is taking the natural language, the prompts from humans and converting that into a system instruction that the agent understands. And that translation, that prompt optimization is very, very tough to do. We see how everybody writes prompts. Sometimes they're good, sometimes they're not. Now once we're able to do that is really where you're speaking about the data harmonization and making sure that we're accessing the right APIs for the right data to retrieve the right content to bring back to the agents and get a good result. Today we are building out lots of new connectors to lots of different data stores. This was announced at AWS re:Invent to be able to begin to get that harmonization both internally and externally, but it really comes down to what we call tool use and being able to access these different databases, have it retrieved by the agents, which would then convert to an answer.

Dave Vellante

>> Is this where knowledge graph comes in?

>> Yes.

Dave Vellante

>> Because you guys have a knowledge graph capability and that seems to me to be an emerging, I know if it's a feature or a tool or a market, it could be all three, but it seems like it's a linchpin toward that harmonization.

>> It is because what the knowledge graph does is you have a RAG application and it'll take out different chunks of data from all these different places to retrieve. And what knowledge graphs do is they connect them based on common entities or names and then they bring them back together in a coherent format. And that kind of knowledge graph helps provide more accurate answers, more complete answers, and more creativity actually in the LLM's responses.

Dave Vellante

>> So how do you think about the promise of agents? It seems like we are entering a new era of potentially productivity, there's obviously concerns about jobs and so forth, but productivity and growth could go through the roof as we enter this AI era. What is the potential of agentic AI?

>> I think we are just getting started and have yet to build out and realize its full capability from a science perspective as well as industry is just getting started. The applications I've seen in agentic systems have definitely been in the call center domains where it has increased containment, meaning that more generative AIs are able to accurately answer questions. But that said, there's more calls continually coming. I've also seen a lot of success in code development and agents coming back with different suggestions for a feature development or code transformation. And those seem to me to be the lowest hanging fruit from what I've seen in the market.

Dave Vellante

>> Interesting. I mean, you mentioned connect. Again, I'm bringing up Jassy, sounds like I'm name dropping, which I am shamelessly, but I learned so much. Andy's amazing.

>> Yeah, he is.

Dave Vellante

>> You have a half hour conversation with a guy and you're like, "Wow, I just learned 20 things." But my point is Amazon's applying a lot of its own, I call it dog fooding, you probably call it drinking your own champagne or something, but applying its technology internally-

>> We do.

Dave Vellante

>> at a massive, massive, massive company that you can then, like you did with Connect, point externally, this is where I think agents have such a tremendous opportunity and Amazon's in a position to really educate and facilitate that adoption. I mean, there's got to be hundreds if not thousands of use cases internally, many of which could turn into either services or software products that you can point at customers.

>> A hundred percent. In fact, we've done some of that with some of our intelligent prompt routing that we released at re:Invent. So we have our internal tech knowledge bases where techs can ask questions about what services should I use for what? And using prompt routing, a lot of queries are just very simple and we found statistically that 40% of such queries are simple and you can route those queries to a small model and gain cost efficiency. And so we're beginning to look into that right now at Amazon ourselves.

Dave Vellante

>> Sherry, I've gone way over my time. Thank you so much-

>> Thank you.

Dave Vellante

>> for coming here and appreciate all-

>> My pleasure.

Dave Vellante

>> your insights. Love to have you back.

>> I'd love to.

Dave Vellante

>> You're local. We're down here a lot, so thank you.

>> My pleasure, thank you.

Dave Vellante

>> You're very welcome. Okay, this wraps up day two. This is Dave Vellante for John Furrier, NYSE Wired and theCUBE's Media Week, AI and Cyber Innovators. We'll be back tomorrow with more great content. Thanks for watching everybody. We'll see you shortly.

>> Thank you.