KubeCon + CloudNativeCon NA 2024 | Akshay Shah, Buf

Clips
News
More from KubeCon + CloudNativeCon NA 2024

Akshay Shah

CTO

Buf

Streaming data infrastructure: Scaling AI with cloud-native innovation

In today’s cloud-native era, efficient streaming data infrastructure is pivotal for scaling artificial intelligence training and operational insights.As organizations increasingly rely on streaming data for artificial intelligence training, analytics and operational insights, the challenges of scaling technologies such as Apache Kafka are coming into sharp focus. These include escalating costs, operational complexities and the inefficiencies of legacy architectures in cloud-native environments, according to Akshay Shah (pictured), chief technology officer of Buf Technologies Inc.CTO Akshay Shah talks to theCUBE about streaming data infrastructure

play_circle_outline Introduction to Buf and its mission outside of Google for protocol buffers and schema-driven development

play_circle_outline Challenges and complexity of scaling Kafka in the cloud

play_circle_outline Benefits of using Bufstream with object storage for cost savings and efficiency in data storage

play_circle_outline Role of Buf Schema Registry in enforcing schema evolution and usage guidelines

Info
Transcript

Akshay Shah, Buf

Akshay Shah

CTO Buf

Established in 2019, Buf extends protocol buffers and schema-driven development outside of Google. Protocol buffers offer efficiency and compatibility for evolving data sets, making them ideal for scaling technologies like Kafka. Bufstream writes data directly to object storage using the Kafka protocol, saving on infrastructure costs. Buf integrates schemas and protocol buffers for data validation and governance at the broker level. The reception for Buf at the CNCF event has been positive within the cloud-native community, recognizing the need for modernizin... Read more

explore Keep Exploring

What was the mission of Buf when it was started in 2019? add

What are the challenges of scaling Kafka in a cloud-native environment? add

What is the difference between Bufstream and traditional Apache Kafka in terms of data storage and cost savings? add

What is the purpose of the schema registry for protocol buffers that Buf builds? add

bolt Powered by CUBE AI

Akshay Shah, Buf

search

>> Hello and welcome back to KubeCon, CloudNativeCon 2024 North America, Salt Lake City. Here in that Rocky Mountain high. Finally adjusting to the low oxygen levels. Got the power going. We're almost done with day two of three days of coverage here at KubeCon. I'm so excited. This next guest or guests, you'll know one of them. One of them may be new to you. But I will say that the topic that we're covering will be pretty applicable to people who understand Kafka and other things and are really building cloud-native infrastructure. So I want to welcome Akshay on, who's the CTO at Buf. Welcome to the show for your first time on here.

Akshay Shah

>> That's right. Thank you for having me.

>> And then I got Sanjeev, Sanjmo joining me again.

>> Yes.

>> We're going to break stuff down. Love it. Love it. So help people understand what Buf is and why you exist. And give us a little background on Buf.

Akshay Shah

>> Absolutely. Buf was started in 2019, and the mission of the company was to bring protocol buffers and schema-driven development outside of Google and really bring first-class tooling, first-class integrations to make it easy for people to do GRPC, and then to take those schemas and push them down into their data infrastructure in a way that's easy, efficient, and generally take a bunch of lessons that were learned over many years within Google and bring them to the rest of the development community.

>> Where within Google did this really originate? And what problem was it solving?

Akshay Shah

>> I was not at Google, and I certainly was not at Google when protocol buffers were invented, but this was very, very early in the core infrastructure of Google. And at the time the alternative serialization formats you might've looked at were probably like XML or something bespoke and customized. And the problem that Protobuf solve are, number one, efficiency. They're extremely fast over a broad array of use cases. And number two, they have really excellent forward and backward compatibility. So if you're on a big team or you have a data set that's evolving over time, you can put guardrails around that evolution so you don't break your data, you don't break your existing readers, and users.

>> If I may add, the reason, in my opinion, why Buf, Protobuf, and they're studying to become more important is because of the rise and the need for streaming data. Because you know how much time we spend talking about training AI models or fine-tuning them or doing RAD. So the use of streaming data is going up but the problem is, a lot of times in technology, correct me if I'm wrong-

Akshay Shah

>> Absolutely,

>> Is that as you scale technology, you have to scale everything around it. So your organization needs to scale. Your cost goes up. So the requirement by businesses is, how can I scale in the streaming space while keeping my cost in check? The cost is going to go up, but it goes up proportionate to business value.

Akshay Shah

>> Absolutely. If you pull on any schema in any organization, whether it's a JSON schema or something for GRPC or Avro, yeah, you'll find a bunch of GRPC stuff, a bunch of Kubernetes services and microservices. But if you keep pulling, all the schemas really are tied to data feeds, and they're tied to data sets in a data lake house or a warehouse somewhere. And that's really what happened to Buf as a company. We started pulling on the schemas and we ended up running straight into Kafka and message queuing and data lake houses. And the big problem with all of these queuing systems and streaming systems is the same thing that the data edge ran into at Uber when I was there. Today, Uber does more than 200 gigabytes-per-second of Kafka traffic. And this, it is brutal. It's operationally really complicated. Here in the cloud-native community we expect Kubernetes to be our abstraction for compute and then, for the most part, we expect object storage to be our abstraction for disks. If there's going to be cross-region replication, or if I want really fast transfer and I want RDMA instead of kernel networking, I want that handled in MinIO or Google Cloud Storage or S3. I don't want to be in the poor Apache Kafka code base in Java trying to eke out small performance wins in the replication layer. So when you try and scale Kafka, what you end up doing is you scale your operational team, because you need a bunch of bespoke knowledge in how to manage partition rebalancing and cluster scaling and all these things, and then you actually just have to pay a whole bunch of money. Because it really wasn't designed to be efficient in the cloud. And so what we have done is we're trying to bring a more modern architecture, a more operationally efficient and cost-efficient approach to this critical concern for every business.

>> How does that help customers really achieve better TCO when they're doing this? Because there has to be a balance between compute and storage and all of that streaming and where you're storing the data and how you're utilizing. I think of it from a retail perspective.

Akshay Shah

>> Okay.

>> You're collecting first-party data across a vast number of devices, be it your laptop, and that is now streaming back into a data warehouse but it's coming from all over the place and stuff like that. And you have to proxy it a little bit. And like you said, you have to be able to collect it, proxy it, and use it. A lot of companies right now are using Kafka to go and actually be that layer underneath there.

Akshay Shah

>> Yeah.

>> How do you see what you're doing in how Buf's approaching it that would be more efficient in that use case?

Akshay Shah

>> Absolutely. I think the reference architecture today if you're using Apache Kafka and the standard modern data stack, what you would do with that workload is you would have all your devices sending data into your backend. All that data would go into Kafka. You'd have some enormous Kafka cluster. And the way Kafka is designed, the compute and the storage come as a unit. So you're provisioning brokers with EBS volumes attached to them, and every time you write to a broker it replicates the data to two other brokers at the application layer for Kafka. And then you run another distributed system, usually Kafka Connect, that's reading from your Kafka cluster and copying the data to your lake house. And when you add all this up what you end up with is at least four, and often six copies of the same data all schlepped between availability zones where every time you bounce out of one zone and into another, AWS charges you $0.02 per gigabyte. And so actually the cost of this whole thing ends up dominated by the replication costs. And so what Bufstream does that's different is it speaks to the exact same Kafka protocol. So your endpoints stay exactly the same. Same clients, same protocol, transactions, exactly along semantics, the whole nine yards. But instead of writing the data to local disk, every Bufstream node writes the data directly to object storage, so to MinIO, to S3, to Google Cloud Storage. And we let object storage handle replication for us. So you get your 11 nines of durability and you get regional or sometimes even cross-region replication for free. It's bundled into the cost of storage. And that lets us serve a workload that might take $100,000, $110,000 on Apache Kafka for only $5,000 of infra cost. It's a huge cost savings. The second thing we do that I think is really a boon to the data engineering and the analytics teams is we take schemas and we take protocol buffers and we put them at the heart of this system. So instead of pushing data validation and governance and access control out into these thick clients where you as the streaming team are running around with your spreadsheet going to every team. You're like, "Can you please install the library? Please install the thick client. Oh, next quarter please update it." And you just kind of go in a circle around the whole company. We take that and we bring it into the brokers and we say, "Look, if you toggle the switch, now the brokers will check the data before it comes into the cluster and we will make sure that all the data coming into your topics, it matches the schema, it's semantically valid, it's nice and cleaned up, it's already at a level of refinement that you would expect from maybe the bronze layer of your data lake architecture. And we'll just store it as an Iceberg table in MinIO." And now writing to a Kafka topic is the same thing as filling up your bronze layer in the data lake. No extra copies, no extra moving parts, no extra operations.

>> That's where I was going to go. And you just said the magic word of, when we go down the Iceberg and all of that and the data lakes.

Akshay Shah

>> Yes. Yeah.

>> And I think to me, are you seeing a lot of uptake with people who are looking? Because this makes a lot of sense. People, even we've had Ali from Databricks on, and he's like, "I don't care about the storage layer. I'll compete at the compute layer and that's where I'm going to compete." Do you see people really embracing Buf because, hey, I can put it into Iceberg and S3 becomes my underlying data lake for that? And then I can go with whatever here. But you guys make it easy to get all that data in there.

Akshay Shah

>> Absolutely. If you go back 10 years and you look at Kubernetes, we were in the middle of this huge battle where there was Kubernetes getting started. We had Mesos. We had Docker Swarm, we had HashiCorp stuff. And it took a while as a community for us to sort out, what is the right pattern here? And can we all get on top of one thing and then get all the ecosystem benefits of centralization, of out of the box integrations? And in a lot of ways I think the modern data stack is where cloud-native compute was 10 years ago. You have this huge profusion of tools. Every company is on this choose their own adventure path, but there are some really common patterns coming out now. And this pattern of Kafka to Parquet and Iceberg to a data lake house is the most common pattern. And I think it's worth it now to say the industry is settling on Iceberg as the lingua franca. If you want something else, you can use XTable. And if you put your data in object storage as Iceberg you can use it in whatever your org uses for their compute layer. You can use it in BigQuery, you can use it in Databricks, you can use it as an external table in Snowflake. You can just use it with DuckDB on your laptop.

>> Right. Yeah.

Akshay Shah

>> I also want to mention there's another interesting thing that Buf is doing. When Kafka first came out, Kafka was a pure message pass. But with Protobuf the schema becomes a first-class citizen. And to me, schema is really important because I come from the data side. So once you have the schema, then you start doing amazing stuff on top of it like data protection, rollback access control, data quality. So a lot of these things now can be applied to the message as it's being generated. So you don't need to land it in Snowflake and then you start assessing the quality all in a lake house. You can do it so the whole shift left comes out of the box.

>> Right. But how does that approach schema versioning and things like that? And how do you fit into that entire data engineering model like you were talking about?

Akshay Shah

>> Yeah. The other product that Buf builds is a schema registry for protocol buffers. And what we do there is we integrate really early in the dev cycle in your ID as you're typing. And the Buf tool chain will tell your developers as they're roughing out changes to a schema. Say like, "Stop right there. That is a backwards-breaking change. Anyone who's reading this data, all their code is immediately going to break. Here's how you can do something similar in a way that is forward-compatible." And as you go through CI, we go through your normal change control process and those schemas end up in the registry, which is wired up directly to your message queue. And so we give you all the guardrails you need to make sure that your schema evolution is safe. And it accounts not only for keeping the read-write semantics of the data intact but also for all the governance things you might want. So especially in a world of RAG and AI pipelines, you probably want some control over not just what topics are visible but, for each field of this message, what is it okay to use this data for? If there's a credit card number in here, it's cool for the billing system to read that and actually bill the person. It's really not cool for that to get wired into some LLM.

>> Right. So you have Stream, Bufstream, Buf... what was it? Registry-

Akshay Shah

>> Buf Schema Registry.

>> Is there other parts to this as well, or how-

Akshay Shah

>> There's an open source tool chain that replaces Google's protocol buffers compiler with a more capable Swiss Army knife. It compiles the protobufs, it lints them, it checks for backwards-breaking changes. And it has a pluggable policy engine for companies that want to enforce these kind of usage guidelines, privacy annotations, and whatever else your privacy or your contracts team wants to enforce.

>> How has the reception been this week here so far?

Akshay Shah

>> It has been pretty tremendous. I think we're the primary backers of a couple of CNCF projects, and that community is growing in a way that we're really happy to see.

>> Which ones?

Akshay Shah

>> It's called Connect. It's an RPC framework.

>> Yep.

Akshay Shah

>> And we've been really excited to see how receptive this community is to a cloud-native, future-facing take on some of this data infrastructure that really has been a little slow to move to Kubernetes and to move to a cloud-native architecture.

>> No, that makes total sense. I think when you look at, and we talk to data people all the time and I've come from that side of the house building stuff. And like I said, I had something that was running Kafka, and the cost was insane in a cloud, so I wholeheartedly agree with that.

>> The other interesting thing about Kafka, and we haven't talked about this, is when we say, "Kafka," we're are actually talking about a whole industry. This could be MSK, it could be Redpanda, it could be WarpStream, Confluent. Now there is .

Akshay Shah

>> Yep.

>> So now Walmart has thousands of Kafka clusters, but they're all different.

>> Yeah. Yeah. I think that's also the thing, is how do you standardize and how do you do that? Because the platform engineers who are walking around here, and I've talked to a few of them, a big piece of it is, how do you do it in a way that I don't have to go and rip and replace? So if I can have those endpoints be the same thing or be able to use Kafka as the ingest type part and then pipe it across directly into S3, that's huge to them because they don't have to go rewrite a whole bunch of stuff in there. So that has to be part of the thing, is that they look at you and they go, "Oh, wow. It's not a rip and replace. It's migration, basically, between the two, and I don't have to change as much stuff."

Akshay Shah

>> In the database world we see this with the Postgres query language, where every relational database, they do their best to be a drop-in replacement for Postgres. And it's because people are familiar with it. The library support is there, the ecosystem is there, and you want to meet your customers where they are.

>> Yeah. So last question for you: when we're together in London having a pint, what do you hope you can say in London that you can't say today?

Akshay Shah

>> I hope I'll be able to look you right in the eye and say that Bufstream is the best way to get your terabyte-per-second workload wired up to Databricks, to Snowflake, to BigLake, to the data warehouse in your cloud provider of choice. And we're getting close but we're not quite there yet.

>> It's just awesome. This has been fun.

Akshay Shah

>> Yeah. Thank you.

>> Thank you both for coming on board, and I look forward to that.

Akshay Shah

>> Absolutely. Thank you so much for having me.

>> Yeah. And thank you for watching this episode. We'll be right back, closing out the day here at KubeCon, CloudNativeCon 2024 North America, Salt Lake City. Here we go. See you soon.