Cloud AWS re:Invent Coverage | Andy Warfield, AWS

Clips
News
More from Cloud AWS re:Invent Coverage

Andy Warfield

VP & Distinguished Engineer

AWS

Reimagining data: How AWS S3 Tables redefine analytics and scalability

Data storage is shifting from static to dynamic with Amazon Web Services Inc.’s introduction of AWS S3 Tables. This move reflects broader industry trends toward open-table formats prioritizing flexibility and interoperability.S3 Tables enable customers to work with mutable, structured query language-like datasets, a significant change from the read-only nature of Apache Parquet files, according to Andy Warfield (pictured), vice president and distinguished engineer at AWS.AWS’ Andy Warfield talks with theCUBE about the impact of AWS S3 Tables on data storage and advancements in scalability, metadata and disaggregation

play_circle_outline Enhancing Data Analytics with Iceberg Tables in S3: AWS re:Invent 2024 Storage Updates

play_circle_outline Integration with Firehose and QuickSight for data analysis

play_circle_outline Unlocking Developer Potential: S3's Data Conversion Impact and Nitro's Disaggregation Advantages

play_circle_outline Managing S3 Value for High Burst Customers Across 10,000 Tenants on Million Hard Drives

play_circle_outline Utilizing Redundancy and Nitro to Scale and Drive Velocity in AWS Storage Systems

Info
Transcript

Andy Warfield, AWS

Andy Warfield

VP & Distinguished Engineer AWS

Dave Vellante and John Furrier are at AWS re:Invent 2024 in Las Vegas. They're excited to talk to Andy Warfield, a distinguished engineer at Amazon Web Services. Andy discusses the latest updates in storage, particularly the S3 table feature, which allows for mutable tables in S3, bringing a more conventional SQL table experience to users. This opens up possibilities for developers and application data usage. Andy also touches upon the advancements in hard drive technology, such as energy-assist hammer drives.

The S3 team has leveraged their experience... Read more

explore Keep Exploring

What is the speaker's favorite update among the 30 updates in the storage world, particularly in S3? add

What are the new features and capabilities of Iceberg tables in S3 with the launch of the S3 table and Firehose support? add

What are some of the advancements and changes in storage technology being discussed at a recent conference or event? add

What is the significance of having over 10,000 customers with their data spread over a million physical hard drives in terms of system utilization and recovery from failures? add

What is the secret to AWS's ability to move quickly and efficiently in terms of hardware operations? add

bolt Powered by CUBE AI

Andy Warfield, AWS

search

Dave Vellante

>> Good afternoon everybody. This is Dave Vellante with John Furrier. This is day four of our AWS re:Invent 2024-coverage here from Las Vegas. We're upstairs, going wall-to-wall. It's been unbelievable. Andy Warfield is here, is the distinguished engineer at Amazon Web Services. Good to see you again. Thanks for coming on.

Andy Warfield

>> Good to see you guys. Thank you for having me.

Dave Vellante

>> Yeah, so while I was saying, we were talking off-camera, and I counted 30 updates in the storage world, and it's so many updates in S3. Amazing. I sat through an hour-long presentation rapid-fire like, "When do we get to ask questions?" And there was five minutes left for Q&A, and so I'm so psyched to have you here so we can dig in a little bit.

Andy Warfield

>> Sure.

Dave Vellante

>> But my question is of those, I don't know if it's 30, but it's close, what's your favorite?

Andy Warfield

>> Oh, geez. It's probably the S3 table stuff. That's been the-

Dave Vellante

>> Very cool.

Andy Warfield

>> big .

Dave Vellante

>> Explain that. Let's dig into that because that is such a hot topic. Open table formats are all the rage. The last couple of years, we've seen the market move to them, but so explain S3 tables.

Andy Warfield

>> Sure. Parquet data, which is used to store tabular data, has been one of the highest grossed types of data on S3. It's become the de-facto.

Dave Vellante

>> Right.

Andy Warfield

>> We do 15 million requests a minute to Parquet data. We serve 150 petabytes of Parquet a day. There's a lot of analytics happening to Parquet. About three years ago we started to hear customers talking a lot about OTFs, and in particular Apache Iceberg, and that's ramped up. And the thing that we found with Iceberg in the customer conversations is on day one it's pretty easy to turn on in Spark, pretty easy to use in tools, but as you get experience using it, it gets a little bit more burdensome to run. And so we were having a lot of customers ask us to make it easier to run Iceberg on top.

Dave Vellante

>> Yeah. I mean OTF, open table formats and customers, what's the motivation? They don't want to be locked into any environment. They want to be able to bring any engine and they want to bring any compute to any data.

Andy Warfield

>> Exactly.

Dave Vellante

>> Okay. We've separated compute from data. Great. That was the cloud. You guys actually made that happen with some of your partners, but now it's opening up even further, opening the aperture, right? And then of course the big question is how do we govern all that stuff?

Andy Warfield

>> Right.

Dave Vellante

>> I know that's not necessarily your swim lane, but that's an important consideration for customers, right?

Andy Warfield

>> Yep.

Dave Vellante

>> Part of that is metadata, which you guys also have made some announcements around.

Andy Warfield

>> Yes, totally. On the table stuff, but on the Iceberg side, the distinction between traditional parquet and the OTFs is that it takes what was basically read-only tables, you could add to them by adding whole Parquet files and makes them mutable, right? Brings them closer to being a more conventional SQL table. That is becoming a primitive in S3, right? You'll be able to create a table bucket, we call it. Create a table inside it, it gets its own endpoint. It's a first-class resource, which means you can set policy. I call policy on the whole table, which was difficult to do before. Because we know that it's tabular data, we get a huge performance bump because we can customize storage and namespace performance to it. And it integrates with everything. The metadata side takes that table and turns it into a system table that we manage. And now as you put data into S3, just as a normal S3 customer, you could turn on metadata and we will fill a table, effectively like CDC, the changes, change data control, the changes into your bucket and populate a journal of all of the changes you've made to the bucket in a table. And so now you can bring SQL tools and go and analyze and look at all of your data. And what we see with a lot of the S3 workflows, gen AI of course, but also things like genomics and health data, is customers are augmenting and extending the data they have in S3. And so having the SQL surface means that they can go and add their own fields, build their own metadata, curate, tag all the data that they have in it.

Dave Vellante

>> Going back to what you were saying about making the table mutable, so that means a third-party client will be able to read and write to those tables?

Andy Warfield

>> Yes, absolutely.

Dave Vellante

>> In a true open format. Okay.

Andy Warfield

>> And so today with the S3 table launch, an Iceberg client talking directly to the table can read and write today. You can turn on Firehose, Amazon Data Firehose added support for Iceberg. You can take any Firehose source and pop them into Iceberg tables in S3 tables. And then you can take QuickSight, which has also added Iceberg support. You can stand up a dashboard and start pulling that stuff out.

Dave Vellante

>> Scope the value there because some people might not understand what that means. The Firehose connects into the S3 table bucket. QuickSight pulls on. Just take us through what it used to be like.

Andy Warfield

>> It was a fair bit of work to go through this stuff.

Dave Vellante

>> How many steps did it take, and now what is the outcome now? Old way, you have to cobble some stuff together. Read, but no writes, and now you've got both and now you've got Firehose. What does that mean? Connect to value.

Andy Warfield

>> It's 100% time to value, right? And I don't think we're all the way there, right? There's still a whole bunch of stuff for us to work through, but relative to working with objects and setting your own schema and working with the object data, it's still a valuable pattern. But now with Iceberg as a common ground for it, you're dealing with rows and columns as an abstraction that you work through. You can have something like Firehose putting stuff into the table. You have other clients putting stuff into the same table because Iceberg mediates that, and then you can attach whatever you want to pull stuff out.

Dave Vellante

>> It's fusion of just integration is seamless.

Andy Warfield

>> And it's been super exciting to see the reaction to it this week. People are pretty excited about it.

Dave Vellante

>> People who know this, nerdy. It's like yeah.

Andy Warfield

>> It is total-

Dave Vellante

>> It's like, I get it.

Andy Warfield

>> Yeah.

Dave Vellante

>> People who know storage.

Andy Warfield

>> Well, and the team all week, the team's been absolutely killing themselves to get this thing together. They've been so invested in it and launching something where you get this reaction. The team is just...

Dave Vellante

>> Well, congratulations to that team, you guys. We've been covering on the Snowflake, Databricks, that whole data warehouse as it goes next level. We've been following a lot of the challenges and opportunities around what this could turn into, for the developers too. And also now you've got the citizen developer now online with the tools like Queue for Business, Queue for Developer.

Andy Warfield

>> Right.

Dave Vellante

>> It's only going to get easier to write apps.

Andy Warfield

>> Absolutely. Well, even QuickSight this week announced support for queue-based natural language queries to build the dashboards themselves. You can sit there in queue and just describe the dashboard that you want and start bringing up-

Dave Vellante

>> Get me sales for the month.

Andy Warfield

>> Yeah.

Dave Vellante

>> Over time, five years.

Andy Warfield

>> Yeah.

Dave Vellante

>> Listen, it's amazing when you think about S3. I mean, the original OG of the cloud, what, 2006?

Andy Warfield

>> 2006.

Dave Vellante

>> Okay. And it's like, oh wow. Object and the cloud, get put, simple, cheap, deep. Okay, great. And now it's evolved, and you've got now a high-performance object, which I think you announced last year at re:Invent. How should we think about the portfolio of services within S3?

Andy Warfield

>> When I was putting together the storage talk for this year, I was reading the 2006 PR, right? The AWS PR and there's this S3, well, it talks about S3 and it says, "S3 is storage for the internet." And just as one sentence in the thing, and that is today, still 18 years later, how the team thinks about it, right? We've always said we will go where the internet takes us in terms of storage. And so you see stuff like table buckets, but all of the things along the way that have come in there.

Dave Vellante

>> Where is the internet taking us now? Because obviously we can see the agentic wave, data having that data layer completely agile, but also extendable and integration, hassle-free, zero ETLs. Now no one even talks about that anymore. It's like, hey, it's happening.

Andy Warfield

>> I'll tell you one thing that I think has been a really interesting change, and it's motivated the tables and the metadata launches this year are for, I don't know, 10 years, my storage conversations have often been about storage, right? They're conversations about performance and scale and-

Dave Vellante

>> Cost.

Andy Warfield

>> Cost. Over the past few years, a lot of those conversations have shifted to us seeing customers build data lakes and inside organizations, pull data from different bits of the organization and build new stuff.

Dave Vellante

>> It's a data conversation.

Andy Warfield

>> It's a data conversation. And the data conversation most recently has shifted to, there's so much data. And it's never like, oh no, it's never a negative. It's like, how do I get to value fast on top of all that? How do I do discovery and how do I do understanding?

Dave Vellante

>> It's data and value.

Andy Warfield

>> We're getting pulled to assembling value on top of the data.

Dave Vellante

>> And the thing about metadata that Mai-Lan was chatting about when she was leaving theCUBE, wasn't on camera, wish it was, but talking about all the auto-reasonings coming out, all reasoning engines are coming. Obviously reasoning's big part of the AI wave, but having that metadata is only going to extend better reasoning.

Andy Warfield

>> Absolutely.

Dave Vellante

>> And to explain why that's so important, having the more metadata around everything.

Andy Warfield

>> On the training and inference side, the quality and the selection of data that we're seeing from folks, it's in every conversation. They want to be able to tag the difference between generated versus source data. They want to be able to differentiate and build highly diverse sets of source data and not keep retraining on the same thing or biased selections. And so having a metadata layer that lets you curate but make the curation sticky, right? And being able to then go and select with a query is a really valuable thing for those folks.

Dave Vellante

>> Yeah, yeah. It's funny how S3 is becoming quite the developer go-to. It's almost like, I've used the word bare metal, but I mean it's as low level primitive.

Andy Warfield

>> John, there's a thing in here that I think you guys will get a kick out of because you guys are a little bit OG on the stuff too. The table stuff has this more subtle thing that I think is cool. All of the coverage I've seen this week is analytics, right? Iceberg. Everybody understands that surface. There's loads to talk about in there, but that is one section of the S3 customer base. And what I'm hearing in a lot of conversations is our builders are excited about using tables for application data, for other types of data. And for you guys, you guys have seen applications built with SQite or Postgres back there for years just hidden in the back of whatever application. And I think there's a neat thing that's about to happen where Iceberg actually ends up being that embedded database in some senses, right?

Dave Vellante

>> Yeah.

Andy Warfield

>> It's the backend that applications write to, except all of a sudden now you can bring any analytics tool to bear on your application data. It's .

Dave Vellante

>> It's cynical of all other things. Because it can integrate with other tables, integration layer.

>> Just think about data warehouses in general, and even the cloud data warehouse, it was NBI. It was created because it was just so difficult to... The storage was just dumb storage and now it's not anymore.

Andy Warfield

>> Yeah.

>> It's like it's got all these capabilities and it's dramatically simpler because you've got object and it's highly performant now, and you have choices there. If you want to spend a little bit more, you can drive that.

Andy Warfield

>> Absolutely.

>> And so the possibilities are quite interesting. What about block? Can you give us the update?

Andy Warfield

>> On the block side-

Dave Vellante

>> Is that not your swim lane or...

Andy Warfield

>> I haven't been working as much with the EBS folks this year. Yeah, there's a ton of movement onto IO2. The progress there continues to be spectacular and get better. I don't have a soundbite on...

Dave Vellante

>> Yeah, yeah, no, it's all good. When I see what you guys did with the new Aurora distributed SQL, I was like, wow. I don't think there's been a meaningful announcement in distributed SQL in years other than Oracle has some stuff, but actually they don't do distributed. But I think, okay, there's got to be some storage behind it.

>> Well, Jazzy mentioned the serverless aspect of Aurora too, with D-SQL, you got the D-SQL and you got the serverless side of it.

Dave Vellante

>> Yep.

>> He was touting that piece.

Dave Vellante

>> And it's got to be high-performance storage underneath. But anyway, S3 just continues to rocket. We love it. Obviously our data lake. I heard at least now it's S3 table buckets. I'm like, table bucket?

Dave Vellante

>> Yeah. Well, and then whoa. Managed Iceberg tables in S3? Holy cow.

Dave Vellante

>> positive we'd riffed on this a lot.

>> Yeah, this is one of our hottest podcast topics.

Andy Warfield

>> Really?

>> Yeah, because the whole... Dave and the team have been digging in hard on something you can read to, not write. There's all these machinations of things around the different people trying .

Andy Warfield

>> Not really a first-party citizen. Now it is. And all of a sudden, boom. Yeah.

>> Yeah. That's a game changer. This is the first re:Invent I've heard that's had, I won't say reconstruction of AWS, but I'll just say like a reconfiguration of some of the things. New core building block with inference, more discussion of primitives. We've talked more about blast radius on theCUBE here than ever. I don't think we've ever used the term blast radius, was an Amazon term, because you got infrastructure advancements. You got-

Andy Warfield

>> It's an old storage term. It is.

Dave Vellante

>> Well, big old disk drive. Remember?

>> When you talking to James, "Oh yeah. They blow up too." The consequences of disruption. Again, disruption kills innovation, right? So when you're at scale, if you're going to be storage for the internet...

Andy Warfield

>> Did you guys see the coverage in Dave's talk about some of the Nitro-based-

>> Yeah, yeah, yeah. .

Andy Warfield

>> That's been a project that we've been working on for a few years internally. And it is-

Dave Vellante

>> Let's talk about that a little bit.

Andy Warfield

>> For sure.

Dave Vellante

>> He spent a lot of time on .

>> Is this Neuralink or the ?

Dave Vellante

>> No, no. Is this for storage. He gave a master class on hard disk drives and blast radius and separating basically the controller and compute function from the backend JBOD, but give us a little tutorial on that.

Andy Warfield

>> Sure. So I mean, on the hard drive side, the drives get bigger and bigger. The performance stays the same, which means the performance actually gets worse per byte, right? You guys know all -

Dave Vellante

>> Yeah. I mean, how long does it take to recover from these huge drives? Right.

Andy Warfield

>> And for us, for the S-three team in particular, we're always trying to adopt the biggest drive. So we always want to drive cost down. We always want to get the most bytes per like square foot of floor space because the amount of data we store is increasing. We got to be efficient there. For years, the way that we drove that was two things, adopting bigger drives and packing more drives into rack. And the way that we would pack more into rack was to build JBODs with more and more drives behind a host.

Dave Vellante

>> And what was that big barge?

Andy Warfield

>> Barge. Yeah.

Dave Vellante

>> So you literally, they showed a picture of it. It was just a-

Andy Warfield

>> Dave talked about kind of our biggest design, which was 288 hard drives behind one server.

Dave Vellante

>> Oh, boy. Talk about blast radius.

Andy Warfield

>> blast radius.

>> Don't stand too close to it.

Dave Vellante

>> Don't wait a match.

Andy Warfield

>> If we did that thing today with 20 terabyte drives on it, it would be six petabytes of storage on a single host. We didn't do that. We backed away from the barge. And at that scale, we were finding we were actually hyper optimized on compute and memory on the server for the number of drives we had. And so we took a decision about four years ago to start exploring another way of doing it and doing this disaggregation thing. And so we stuck Nitro in the hard drive rack and Nitro's job for the hard drives is the same as it is for the compute. Nitro's virtualizing the drives, and it's doing basically nothing else. So, it's just in there to put the drive on the EBS connection on SRD into the instance. And now the S3 team is free to use any compute, any ECU-

Dave Vellante

>> Like you scale it independently.

Andy Warfield

>> If you lose the host, you just remap the drives to another host. So the ability is better, right?

Dave Vellante

>> Done, right.

Andy Warfield

>> The flexibility for developers is better. We can actually change instances as the workload on the drives change. So when we first bring a drive online and we're filling it and we're driving a ton of work to it, we can use a beefier instance and then we can scale down as the thing comes into -

>> The elasticity is amazing.

Dave Vellante

>> I mean, I haven't followed it for years, but how are the hard drive manufacturers achieving these densities? They can't stack more platters in there, right?

Andy Warfield

>> They do. They do actually.

Dave Vellante

>> Is that what they're doing? How many platters in a disk drive?

Andy Warfield

>> I think the most recent might be 11.

>> Is it really?

Dave Vellante

>> Oh, my god. Now they're not spinning them faster, right?

Andy Warfield

>> No, the rotational speed's been about the same, but they've moved into energy assist. So the hammer drives have been coming out. And you guys could go do a deep dive on that. That would fascinate everyone because those drives have a laser that heats up the surface of the drive to make it more receptive to small magnetic changes. The fly height, the head to the surface of the drive is one nanometer on a modern drive.

Dave Vellante

>> One nanometer, really?

Andy Warfield

>> 10 carbon atoms in that space.

Dave Vellante

>> That's insane.

Andy Warfield

>> It's crazy. And they work, obviously.

Dave Vellante

>> Yeah. Okay. And so that's been able to... That's actually something, one of the few things, John, we got wrong. Because we saw the price of flash coming down and said, "Okay, that's going to kill the high-spin speed disk," which it did, but the price of the hard disk it's just, they've done a remarkable job. Basically two manufacturers left-

Andy Warfield

>> Stayed ahead of it.

Dave Vellante

>> It's like, yeah.

Andy Warfield

>> Absolutely. There's another fact in this that you guys will love, which was that we were looking at the potential in S3, right? The value of S3, when you build a storage system for anything, you're always provisioning for the peak. You're always provisioning for your need. And so when you're building a single tenant enterprise system, you have to anticipate the busiest hour of the day, but then all the other hours have that value of unused stuff. And so the S3 value obviously, is that with loads of tenants, we stack all those bursts up and the system is more utilized. And so we looked at what is the value to the highest bursting customers on S3? And the thing we found as we dug into it, we asked which buckets had data on the most drives, and we found that we have over 10,000 customers, over 10,000 buckets that have their data spread over a million physical hard drives. There's a million disks-

Dave Vellante

>> Wow.

Andy Warfield

>> with the customer's data. It's just, imagine building that system for yourself. It's -

Dave Vellante

>> And so what you have to think about how to recover from failures, right? Because the -

Andy Warfield

>> We're constantly recovering from .

Dave Vellante

>> Because I don't know what the MTBF is on a disk drive these days, but whatever it is, however many drives you have, you're guaranteed be a failure.

Andy Warfield

>> That's right. There's failures, the drives fail continuously.

Dave Vellante

>> All the time.

Andy Warfield

>> It's the population -

Dave Vellante

>> Even though they're super reliable.

Andy Warfield

>> That's right.

Dave Vellante

>> It's just the math.

Andy Warfield

>> We're just constantly dealing with that, it's part of the system.

>> Andy, one of the things we've been hearing this week, I want to get your thoughts on this as you guys think about the future, is that Amazon's reached this point of scale where it's kind of rarefied air in the sense that you see things that scale and mention some of the things, but the drives and the failures, the blast radius. What are you seeing as opportunities that the S3 team can do that's not possible by just anyone else just starting from ground zero or time zero? As you start to see these innovations and a lot more things going on, the integration of graphs, what kind of things are emerging to the team in terms of, "Wow, we could do that. That's something that was on the radar, that's now part of where your view is? You're seeing things...

Andy Warfield

>> So to pick one, on the hardware side that we're talking about, there's a great talk you guys should watch the recording from earlier in the week, Seth Markle and James Bornholte are PEs in S3 gave this talk about S3 gory internals. I was jealous as heck about-

Dave Vellante

>> Total mechanics under the hood.

Andy Warfield

>> Total mechanics. But one of the things they called out is the team has 18 years of experience operating a massive hard drive fleet. And one thing that we've really started to get good at is we basically hedge against that continuous rate of drive failure by having redundancy, spotting the failures and rebuilding way ahead of it. We maintain a huge buffer. It's kind of what's behind the stuff. And in recent years, what we've realized is we can use that exact mechanism to move faster, right? Because if we want to bring in new software or we want to bring in new drives, we can actually pad out that redundancy and take a small population of production hosts, deploy a new thing and run straight into production with it without exposing anyone to heightened and risk. And so the scale that we operate at actually means that we can move faster in a storage system that historically you would perceive as being like a battleship of slowness.

Dave Vellante

>> So speed. The one-

Andy Warfield

>> So, our scale is driving velocity on stuff.

Dave Vellante

>> It's astounding, seeing Nitro do just that one function.

Andy Warfield

>> Yep.

Dave Vellante

>> We called it years ago, we called it AWS's secret weapon and it's so true.

Andy Warfield

>> It totally is.

Dave Vellante

>> Amazing.

>> That's awesome.

Dave Vellante

>> All right, hey, thanks for coming on theCUBE-

Andy Warfield

>> It's great .

Dave Vellante

>> and give us a little inside baseball on this. We really appreciate it, Andy.

>> Thank you. Thanks.

Andy Warfield

>> Thank you.

Dave Vellante

>> John Furrier, Dave Vellante. We'll be right back right after this short break. We're here at re:Invent 2024 in Las Vegas. You're watching theCUBE.