Snowflake Data Cloud Summit 2024 | Ron Ortloff | Snowflake

Clips
News
More from Snowflake Data Cloud Summit 2024

Ron Ortloff

Head of Iceberg & Data Lake

Snowflake

Enhancing data management with Apache Iceberg data lake integration

The data management landscape is evolving rapidly, with organizations seeking more efficient ways to handle vast amounts of information through innovations such as Apache Iceberg data lake integration.As businesses generate and utilize increasing volumes of data, the need for flexible, scalable and integrated solutions becomes even more important. Innovations such as Apache Iceberg provide new opportunities for organizations to streamline data operations and maximize the value derived from datasets, according to Ron Ortloff (pictured), senior product manager, Iceberg and data lake, at Snowflake Inc.Snowflake’s Ron Ortloff talks to theCUBE about Apache Iceberg data lake integration

play_circle_outline Exploring Managed vs Unmanaged Iceberg Tables: Use Cases for a Great Snowflake Experience

play_circle_outline Snowflake offering managed Iceberg tables for end-to-end platform features and maintenance

Info

Ron Ortloff | Snowflake

Ron Ortloff

Head of Iceberg & Data Lake Snowflake

search

>> Welcome back to Moscone South, everybody. My name is Dave Vellante, and it's rocking here at Snowflake Summit. I'm here with George Gilbert. Rebecca Knight is also in the house, and we're really excited to have Ron Ortloff here as the Head of Iceberg and Data Lake at Snowflake. We're going to geek out big time. George, Ron, welcome to theCUBE. Thanks for coming on.

>> Hey, George. Thanks for having me.

>> Iceberg's all the talk. We had the great Iceberg debate last night, which is going to be ongoing. We love the action, the competitiveness, but let's start with Iceberg. What do people need to know about Iceberg Open Table formats? Why now? Why is it so important?

>> Yeah. The big deal at Summit here yesterday we announced that Iceberg tables are GA, so generally available, had a lot of pent-up demand customers trying it out in preview and just really happy and super excited that it's now generally available for customers to adopt and deploy to production.

>> Right. Okay, but in 2022, I remember Benoit asked the audience, "Who here has heard of Iceberg?" Five hands went up in the audience. I had heard of it, but I was like, "Tell me more."

>> A little bit different now.

>> Right, now everybody knows. Explain what Iceberg tables are and why they're all of a sudden so important.

>> Yeah. Really, at the core of Apache Iceberg is compute engine interoperability. So we've got a growing community of vendors that are contributing to the project. We have now a wide variety of compute engines that support Apache Iceberg, and this is where customers now have more choice. You have some shops that use Spark, some shops that use Snowflake. Now they can have a single copy of data in an open table format and be able to interoperate on top of that data.

>> However.

>> Right.

>> Okay. So help us distinguish between, there's still two types of Iceberg tables, the managed Iceberg, which is, it sounds like you get all the functionality you get from native Snowflake tables and then unmanaged. But while you're telling us about those two types, help us also understand what customers are first trying to do, the workloads where they want Iceberg formatted data.

>> Maybe I'll answer the last question first.

>> Yeah, definitely start there. That makes sense.

>> Think about how customers are architecting their systems. So you have some customers that have scenarios where they want to leverage for Snowflake for the greatness of data sharing or AIML or the advanced RBAC features that we have. They don't want to have to copy that data into Snowflake. They've got petabytes and petabytes of data in their data lakes. They want to be able to have Snowflake reference that data in place, zero ingestion. So that's where having your data in Parquet, in an Apache Iceberg catalog, we can connect to that data directly and give you that great Snowflake experience. So that's sometimes been historically referred to as unmanaged, but that's more of the scenario that people are going for where they don't want to ingest that data, they want to use it in place and have a great Snowflake experience on top of it.

>> When you say that great Snowflake experience, you mean you're going to be able to read that data into Snowflake-

>> Yeah.

>> Right?

>> Absolutely. Yeah.

>> Then it will be-

>> Have everything I get from Snowflake-

>> Yeah....

>> RBAC features, AIML, all that stuff is there, data sharing. Those are some of the big things that are exciting our customers.

>> It's read-only access, correct?

>> It is read-only access. We refer to that as an externally-managed scenario where you have an engine that is producing your Parquet, producing your Iceberg metadata. Some other system is taking care of table maintenance of that data. Again, Snowflake's giving you that great platform experience on top of that offering.

>> Okay. What if I want to be able to have read-write access to it?

>> Yeah, so we do have an option where you can leverage Snowflake as a truly managed Iceberg table experience. So here at Snowflake you can leverage end-to-end platform features and gestate into the Snowflake engine. We will produce Parquet, very efficient Parquet. We produce Iceberg metadata. We have an Iceberg catalog now. You can connect to that from engines like Spark and consume that data directly from your cloud storage account. It is truly a managed end-to-end experience. We take care of table maintenance for you, things like compaction, where if you're doing it on your own, you're spinning up your own compute and maintaining your own jobs. Here with a managed experience, Snowflake takes care of all that for you.

>> How does that work? This may be such a basic question, but are you bringing that data into Snowflake or are you wrapping Snowflake around the data?

>> Whenever you talk about Snowflake and Iceberg data is always outside, it's stored externally.

>> All right. Okay, great.

>> Right?

>> So the latter, really.

>> More the latter.

>> You're bringing that Snowflake experience to the Iceberg.

>> To that data. But if you're like 100% all in on managed Iceberg tables with Snowflake, you're still getting only a compute bill for storage from your cloud provider. Right? There is no storage inside of Snowflake. You will never be billed for storage from Snowflake.

>> Got it.

>> But just to clarify, when we have, I guess the term now is externally-managed Iceberg tables where you have a big data estate, what used to be maybe Hadoop, what Snowflake functionality is unavailable on that externally-managed stuff, and leave out... we understand no rights, but do you not get the attribute-based access control? What is it that you don't get from Snowflake?

>> Yeah, in this sort of model, again, think of there's more producers of the data, the data's being curated, whether it's in a medallion type of architecture, perhaps there's a bronze and silver zone that are curating that data in place, and Snowflake is reading that and accessing that data. We don't assume that we have right permissions to that data. That's how some people want to configure their system. So something like Snowflake clustering where we physically reorder the data and write it, that would be an example of a capability that's not available for an externally-managed data.

>> Okay. So can you apply attribute-based access control like-

>> Yes. Yeah, a ton....

>> dynamic masking and all that sort of stuff on-

>> RBAC on that externally-managed table.

>> External. Okay.

>> Yeah, that's a fantastic use case, where I was talking about data sharing or buzzwords, AIML stuff, but basic RBAC stuff works very well, and it is very popular with our customers.

>> So-

>> So if a customer chooses, sorry, George, if a customer chooses to use Iceberg and that, but wants that RBAC capability, all the other wonderful governance, they use managed Iceberg tables, great, check it and-

>> That doesn't really matter. You'll get it in any Iceberg experience with Snowflake.

>> Right, okay.

>> Keeping for the externally-managed.

>> Yeah, no, I understand that. But then on the other end, what's the trade-off for the customer on the other end? What can't they do that they could have done if they weren't going into managed tables?

>> Yeah. The thing like we talked about just now with clustering where we physically reorder and optimize the data, you're going to need to leverage compute outside of Snowflake.

>> Okay, I see. Okay. So this is-

>> For maintenance. Maintenance. It's maintenance.

>> For maintenance. Okay, great. So this is the big Iceberg debate. I say, why not just put it to Snowflake? Of course, I love that answer.

>> Well, I think maybe another way to think about it, at the core, Iceberg is a metadata solution as well as Parquet, right? That's the prevailing file format-

>> Sure....

>> even within Iceberg. Most operational systems do not produce Parquet. It's an analytical format, it's columnar based, it's high compression. Something usually has to take data from its original format into Parquet. So part of that managed table experience, we can do that conversion into Parquet, give you the Iceberg metadata, which you were going to have to do somewhere anyway, right? So use it as part of now the managed table experience, bring that data, ingest into Snowflake. We'll write great Parquet for you, put it on your cloud storage account, and then Iceberg kicks in from there.

>> How difficult is it, if at all, to take an externally-managed set of Iceberg tables and make them fully managed?

>> It's a simple command.

>> So there's no import/export or anything like that.

>> So we do have a lot of customers that are in a situation where they're using an external catalog, the DIY approach, do-it-yourself, where they've spun up their own maintenance jobs, but they're getting tired of that. They're growing too big, it's getting too complex. You can take those externally-managed Iceberg tables in Snowflake and run a table conversion command, which then transfers control to the Snowflake catalog. They become a fully managed table in place. We don't touch the existing Parquet data that's there or the metadata snapshots that are there. We leave it all, and then we start treating that as a managed table experience. So this is very popular for customers that don't want to go through a big upfront ingestion of data into Snowflake. They have an existing estate, but they want to simplify and consolidate into a managed table experience.

>> So the story here is Snowflake's responding to the pressure from the marketplace and the customer's saying, "Look, just exactly that. We might not want to-

>> Friendly reminder that the partner keynote is now open. Please make your way there.

>> "We might not want to put all the data into Snowflake, at least for now. So we're going to leave it in our Iceberg tables." You guys are responding to that saying, "Okay, fine, we'll support that, and then we'll see what happens over time." George, I have a feeling that a lot of times customers will be like, "Yeah, but I really want..." I said it, customers want to have their cake, they want to eat it, and they want to gain weight. What do you say? They want to solve cancer too.

>> Yeah, yeah.

>> But you can't have it all.

>> Tastes like chocolate and it cures cancer.

>> Do you want it to be open and limit the "risk?" I put risk in quotes 'cause I don't think it's that risky of lock-in, or do you want the integrated experience, and do you want all the benefits that come around with it? I personally think the market will always choose the latter unless open-source can do what the integrated solution does.

>> Well, let me ask then a different-

>> We're having the debate in front of Ron, I apologize. That's really rude of me....

>> a different scenario. So let's say you've got a lot of managed tables in Iceberg that they're essentially inside Snowflake, but a customer says, "I want to use Spark to do some transformations," or, "My data scientists just want to work in Databricks"-

>> Sure....

>> something like that.

>> Yeah.

>> So they can read all that data-

>> Yep....

>> they just can't update it.

>> Right.

>> In a medallion architecture, there might be bronze and silver in the managed Iceberg tables, and then they might create data products, gold products in Spark. If they can read and write Delta and Iceberg together, that might be outside then.

>> Yeah.

>> In fact, that might be an externally-managed Iceberg table from your point of view.

>> Sure. Yeah. Yeah. Yeah, but I think this is also where Polaris can help now start filling in some of those gaps, right?

>> Okay. Elaborate.

>> Yeah. So with Polaris now you can have multiple catalogs that are federated under a single umbrella, a single pane of glass where you have RBAC that's shared across these different catalogs. So like you're saying, I could have Spark that's owning a couple of different tables. In Snowflake, like you said, I have my bronze and silver layers, but everything will be under a single pane of glass, a single layer of governance, and you'll be able to query that from Spark, right to that catalog from Spark also produced, like you said, those bronze and silver layers in Snowflake and consume those.

>> What does that mean to Iceberg, specifically, tabular now being part of Databricks in theory?

>> At the end of the day, Apache Iceberg is an open-source project. The great thing about the Apache Software Foundation is it has a very transparent PMC, committer, disclosure. Things are very, very transparent and visible. So there is the power of that sort of open-source and the long track record that the Apache Software Foundation, doing things like Apache Parquet, Apache Spark, these things are fundamental capabilities in our industry now. Iceberg is just another one of those things under that same umbrella.

>> And no one vendor can own it.

>> No one vendor can own it.

>> If they try to, then it just becomes closed source.

>> But just to be clear, the scenario could be good and favorable for Snowflake where if it becomes transparent, whether you're reading and writing Iceberg or Delta, and then there's an open-source catalog that you can synchronize with, others can synchronize with, then we really eliminate data formats as an issue. Then we are worried more at the higher layers, which is the metadata management and the capabilities of the compute engines.

>> That's right. That's right.

>> Then you're competing on that level.

>> Yeah. You're elevating it to another level and that's, again, where that compute engine interoperability comes into play.

>> If you have the best compute engine that's in your favor, is your point.

>> That's where we want to compete. We feel we have a fantastic data platform. Tons of capabilities, right? Fantastic query engine, long track record of tremendous amount of performance that satisfy our customers. That's where we want to compete.

>> All right, Ron, you got six minutes to get to your next appointment at North. So thank you very much.

>> Awesome. Yeah, this was fun.

>> Really great having you on.

>> Thanks, Ron.

>> All right. Thank you, George. Don't get up yet.

>> All right.

>> All right. Thank you. Keep it right there. We'll be back with our next guest. This is Dave Vellante for George Gilbert and Rebecca Knight. We're live from Snowflake Summit 2024 in Moscone South. Stop by and see us.

Ron Ortloff | Snowflake

search

>> Welcome back to Moscone South, everybody. My name is Dave Vellante, and it's rocking here at Snowflake Summit. I'm here with George Gilbert. Rebecca Knight is also in the house, and we're really excited to have Ron Ortloff here as the Head of Iceberg and Data Lake at Snowflake. We're going to geek out big time. George, Ron, welcome to theCUBE. Thanks for coming on.

>> Hey, George. Thanks for having me.

>> Iceberg's all the talk. We had the great Iceberg debate last night, which is going to be ongoing. We love the action, the competitiveness, but let's start with Iceberg. What do people need to know about Iceberg Open Table formats? Why now? Why is it so important?

>> Yeah. The big deal at Summit here yesterday we announced that Iceberg tables are GA, so generally available, had a lot of pent-up demand customers trying it out in preview and just really happy and super excited that it's now generally available for customers to adopt and deploy to production.

>> Right. Okay, but in 2022, I remember Benoit asked the audience, "Who here has heard of Iceberg?" Five hands went up in the audience. I had heard of it, but I was like, "Tell me more."

>> A little bit different now.

>> Right, now everybody knows. Explain what Iceberg tables are and why they're all of a sudden so important.

>> Yeah. Really, at the core of Apache Iceberg is compute engine interoperability. So we've got a growing community of vendors that are contributing to the project. We have now a wide variety of compute engines that support Apache Iceberg, and this is where customers now have more choice. You have some shops that use Spark, some shops that use Snowflake. Now they can have a single copy of data in an open table format and be able to interoperate on top of that data.

>> However.

>> Right.

>> Okay. So help us distinguish between, there's still two types of Iceberg tables, the managed Iceberg, which is, it sounds like you get all the functionality you get from native Snowflake tables and then unmanaged. But while you're telling us about those two types, help us also understand what customers are first trying to do, the workloads where they want Iceberg formatted data.

>> Maybe I'll answer the last question first.

>> Yeah, definitely start there. That makes sense.

>> Think about how customers are architecting their systems. So you have some customers that have scenarios where they want to leverage for Snowflake for the greatness of data sharing or AIML or the advanced RBAC features that we have. They don't want to have to copy that data into Snowflake. They've got petabytes and petabytes of data in their data lakes. They want to be able to have Snowflake reference that data in place, zero ingestion. So that's where having your data in Parquet, in an Apache Iceberg catalog, we can connect to that data directly and give you that great Snowflake experience. So that's sometimes been historically referred to as unmanaged, but that's more of the scenario that people are going for where they don't want to ingest that data, they want to use it in place and have a great Snowflake experience on top of it.

>> When you say that great Snowflake experience, you mean you're going to be able to read that data into Snowflake-

>> Yeah.

>> Right?

>> Absolutely. Yeah.

>> Then it will be-

>> Have everything I get from Snowflake-

>> Yeah....

>> RBAC features, AIML, all that stuff is there, data sharing. Those are some of the big things that are exciting our customers.

>> It's read-only access, correct?

>> It is read-only access. We refer to that as an externally-managed scenario where you have an engine that is producing your Parquet, producing your Iceberg metadata. Some other system is taking care of table maintenance of that data. Again, Snowflake's giving you that great platform experience on top of that offering.

>> Okay. What if I want to be able to have read-write access to it?

>> Yeah, so we do have an option where you can leverage Snowflake as a truly managed Iceberg table experience. So here at Snowflake you can leverage end-to-end platform features and gestate into the Snowflake engine. We will produce Parquet, very efficient Parquet. We produce Iceberg metadata. We have an Iceberg catalog now. You can connect to that from engines like Spark and consume that data directly from your cloud storage account. It is truly a managed end-to-end experience. We take care of table maintenance for you, things like compaction, where if you're doing it on your own, you're spinning up your own compute and maintaining your own jobs. Here with a managed experience, Snowflake takes care of all that for you.

>> How does that work? This may be such a basic question, but are you bringing that data into Snowflake or are you wrapping Snowflake around the data?

>> Whenever you talk about Snowflake and Iceberg data is always outside, it's stored externally.

>> All right. Okay, great.

>> Right?

>> So the latter, really.

>> More the latter.

>> You're bringing that Snowflake experience to the Iceberg.

>> To that data. But if you're like 100% all in on managed Iceberg tables with Snowflake, you're still getting only a compute bill for storage from your cloud provider. Right? There is no storage inside of Snowflake. You will never be billed for storage from Snowflake.

>> Got it.

>> But just to clarify, when we have, I guess the term now is externally-managed Iceberg tables where you have a big data estate, what used to be maybe Hadoop, what Snowflake functionality is unavailable on that externally-managed stuff, and leave out... we understand no rights, but do you not get the attribute-based access control? What is it that you don't get from Snowflake?

>> Yeah, in this sort of model, again, think of there's more producers of the data, the data's being curated, whether it's in a medallion type of architecture, perhaps there's a bronze and silver zone that are curating that data in place, and Snowflake is reading that and accessing that data. We don't assume that we have right permissions to that data. That's how some people want to configure their system. So something like Snowflake clustering where we physically reorder the data and write it, that would be an example of a capability that's not available for an externally-managed data.

>> Okay. So can you apply attribute-based access control like-

>> Yes. Yeah, a ton....

>> dynamic masking and all that sort of stuff on-

>> RBAC on that externally-managed table.

>> External. Okay.

>> Yeah, that's a fantastic use case, where I was talking about data sharing or buzzwords, AIML stuff, but basic RBAC stuff works very well, and it is very popular with our customers.

>> So-

>> So if a customer chooses, sorry, George, if a customer chooses to use Iceberg and that, but wants that RBAC capability, all the other wonderful governance, they use managed Iceberg tables, great, check it and-

>> That doesn't really matter. You'll get it in any Iceberg experience with Snowflake.

>> Right, okay.

>> Keeping for the externally-managed.

>> Yeah, no, I understand that. But then on the other end, what's the trade-off for the customer on the other end? What can't they do that they could have done if they weren't going into managed tables?

>> Yeah. The thing like we talked about just now with clustering where we physically reorder and optimize the data, you're going to need to leverage compute outside of Snowflake.

>> Okay, I see. Okay. So this is-

>> For maintenance. Maintenance. It's maintenance.

>> For maintenance. Okay, great. So this is the big Iceberg debate. I say, why not just put it to Snowflake? Of course, I love that answer.

>> Well, I think maybe another way to think about it, at the core, Iceberg is a metadata solution as well as Parquet, right? That's the prevailing file format-

>> Sure....

>> even within Iceberg. Most operational systems do not produce Parquet. It's an analytical format, it's columnar based, it's high compression. Something usually has to take data from its original format into Parquet. So part of that managed table experience, we can do that conversion into Parquet, give you the Iceberg metadata, which you were going to have to do somewhere anyway, right? So use it as part of now the managed table experience, bring that data, ingest into Snowflake. We'll write great Parquet for you, put it on your cloud storage account, and then Iceberg kicks in from there.

>> How difficult is it, if at all, to take an externally-managed set of Iceberg tables and make them fully managed?

>> It's a simple command.

>> So there's no import/export or anything like that.

>> So we do have a lot of customers that are in a situation where they're using an external catalog, the DIY approach, do-it-yourself, where they've spun up their own maintenance jobs, but they're getting tired of that. They're growing too big, it's getting too complex. You can take those externally-managed Iceberg tables in Snowflake and run a table conversion command, which then transfers control to the Snowflake catalog. They become a fully managed table in place. We don't touch the existing Parquet data that's there or the metadata snapshots that are there. We leave it all, and then we start treating that as a managed table experience. So this is very popular for customers that don't want to go through a big upfront ingestion of data into Snowflake. They have an existing estate, but they want to simplify and consolidate into a managed table experience.

>> So the story here is Snowflake's responding to the pressure from the marketplace and the customer's saying, "Look, just exactly that. We might not want to-

>> Friendly reminder that the partner keynote is now open. Please make your way there.

>> "We might not want to put all the data into Snowflake, at least for now. So we're going to leave it in our Iceberg tables." You guys are responding to that saying, "Okay, fine, we'll support that, and then we'll see what happens over time." George, I have a feeling that a lot of times customers will be like, "Yeah, but I really want..." I said it, customers want to have their cake, they want to eat it, and they want to gain weight. What do you say? They want to solve cancer too.

>> Yeah, yeah.

>> But you can't have it all.

>> Tastes like chocolate and it cures cancer.

>> Do you want it to be open and limit the "risk?" I put risk in quotes 'cause I don't think it's that risky of lock-in, or do you want the integrated experience, and do you want all the benefits that come around with it? I personally think the market will always choose the latter unless open-source can do what the integrated solution does.

>> Well, let me ask then a different-

>> We're having the debate in front of Ron, I apologize. That's really rude of me....

>> a different scenario. So let's say you've got a lot of managed tables in Iceberg that they're essentially inside Snowflake, but a customer says, "I want to use Spark to do some transformations," or, "My data scientists just want to work in Databricks"-

>> Sure....

>> something like that.

>> Yeah.

>> So they can read all that data-

>> Yep....

>> they just can't update it.

>> Right.

>> In a medallion architecture, there might be bronze and silver in the managed Iceberg tables, and then they might create data products, gold products in Spark. If they can read and write Delta and Iceberg together, that might be outside then.

>> Yeah.

>> In fact, that might be an externally-managed Iceberg table from your point of view.

>> Sure. Yeah. Yeah. Yeah, but I think this is also where Polaris can help now start filling in some of those gaps, right?

>> Okay. Elaborate.

>> Yeah. So with Polaris now you can have multiple catalogs that are federated under a single umbrella, a single pane of glass where you have RBAC that's shared across these different catalogs. So like you're saying, I could have Spark that's owning a couple of different tables. In Snowflake, like you said, I have my bronze and silver layers, but everything will be under a single pane of glass, a single layer of governance, and you'll be able to query that from Spark, right to that catalog from Spark also produced, like you said, those bronze and silver layers in Snowflake and consume those.

>> What does that mean to Iceberg, specifically, tabular now being part of Databricks in theory?

>> At the end of the day, Apache Iceberg is an open-source project. The great thing about the Apache Software Foundation is it has a very transparent PMC, committer, disclosure. Things are very, very transparent and visible. So there is the power of that sort of open-source and the long track record that the Apache Software Foundation, doing things like Apache Parquet, Apache Spark, these things are fundamental capabilities in our industry now. Iceberg is just another one of those things under that same umbrella.

>> And no one vendor can own it.

>> No one vendor can own it.

>> If they try to, then it just becomes closed source.

>> But just to be clear, the scenario could be good and favorable for Snowflake where if it becomes transparent, whether you're reading and writing Iceberg or Delta, and then there's an open-source catalog that you can synchronize with, others can synchronize with, then we really eliminate data formats as an issue. Then we are worried more at the higher layers, which is the metadata management and the capabilities of the compute engines.

>> That's right. That's right.

>> Then you're competing on that level.

>> Yeah. You're elevating it to another level and that's, again, where that compute engine interoperability comes into play.

>> If you have the best compute engine that's in your favor, is your point.

>> That's where we want to compete. We feel we have a fantastic data platform. Tons of capabilities, right? Fantastic query engine, long track record of tremendous amount of performance that satisfy our customers. That's where we want to compete.

>> All right, Ron, you got six minutes to get to your next appointment at North. So thank you very much.

>> Awesome. Yeah, this was fun.

>> Really great having you on.

>> Thanks, Ron.

>> All right. Thank you, George. Don't get up yet.

>> All right.

>> All right. Thank you. Keep it right there. We'll be back with our next guest. This is Dave Vellante for George Gilbert and Rebecca Knight. We're live from Snowflake Summit 2024 in Moscone South. Stop by and see us.