theCUBE + NYSE Wired: Mixture of Experts Series | Sudhir Hasbe, Neo4j

Clips
More from theCUBE + NYSE Wired: Mixture of Experts Series

Sudhir Hasbe

Chief Product Officer

Neo4j

play_circle_outline Sudhir’s Vision for Graph Technology: Introducing Neo4j’s InfiniGraph for Scalable Graph Databases

play_circle_outline InfiniGraph: Overcoming Neo4j's Limitations with Scalable Multi-Machine Graphs for 100+ Terabytes of Data Capacity

play_circle_outline Navigating Complex Data Growth: Neo4j and Generative AI in Banking and Life Sciences for Enhanced Fraud Detection

play_circle_outline Operational and analytical workload convergence in Neo4j's graph architecture.

play_circle_outline Sudhir's emphasis on enabling businesses to derive value from AI through knowledge graphs.

Info
Transcript

Sudhir Hasbe, Neo4j

Sudhir Hasbe

Chief Product Officer Neo4j

In this Mixture of Experts interview from theCUBE’s NYSE Wired studio, host John Furrier sits down with Neo4j President & Chief Product Officer, Sudhir Hasbe, to break exclusive news: InfiniGraph, Neo4j’s next-generation, fully distributed graph capability. Hasbe details how Neo4j shards highly connected graphs across many machines while preserving core graph behaviors and parallel runtime performance – enabling graphs that span 100+ terabytes and billions of nodes/relationships. He shares real-world scale from customers such as Dun & Bradstreet and Novo Nord... Read more

explore Keep Exploring

What are the details regarding the new product capability being launched by Neo4j and the challenges faced with their previous system? add

What challenges were faced regarding data limits on a single machine, and what solutions were developed to overcome them? add

What are the core problems that Neo4j aims to solve, and how do these issues impact its growth potential? add

What are the capabilities of the Neo4j system in handling both operational and analytical workloads, particularly in the context of graph databases? add

What is the significance of the upcoming revolution in business driven by AI? add

bolt Powered by CUBE AI

Sudhir Hasbe, Neo4j

search

>> Hello, I'm John Furrier with theCUBE. We are here at our New York Stock Exchange studio. TheCUBE Studio is on the East coast, of course, part of the NYSE and theCUBE Wired network and series. We got some exciting news here with Neo4j's Sudhir here, is back in theCUBE. He's the President and CPO, Chief Product Officer. Sudhir, thanks for coming in. You got some big launch exclusive news. Thanks for coming in.

Sudhir Hasbe

>> Thanks, John. Looking forward to the conversation.

>> So everyone who watches theCUBE knows that I love graph databases, because of the we're in a network effect, it's social, the infrastructure of the connected world is essentially now distributed computing. You guys have been the leader in graph databases. You run the products. Now President, congratulations. You got some big news. Share the news. InfiniGraph?

Sudhir Hasbe

>> InfiniGraph.

>> InfiniGraph.

Sudhir Hasbe

>> Like Infinite Graph capability.

>> All right. What is the news?

Sudhir Hasbe

>> Yeah, so we are launching our next generation product capability with InfiniGraph. It's like in a completely scalable distributed database that we are launching. So I joined Neo4j around two and a half years back. I came from Google. I ran the big data systems there, all of analytics including BigQuery. And I had this vision that if you could apply graphs on top of that kind of consolidated data for all the organizations, it would be a game changer. But our limiting factor in Neo4j was your graph had to fit on a machine and then we could actually replicate it, made it highly available and all, but we were limited by how much data could fit on a single machine. And that bothered me a lot coming from a big data systems. I've spent bunch of years, we were talking about, on Hadoop. Then I spent bunch of years with BigQuery and worked with Snowflake, Databricks in that environment with customers. And so coming in, I wanted to break that barrier and so that's what we are broken. We've been working on it for two plus years now. Engineering has done a tremendous amount of work to take graphs, shard them across multiple machines without losing the benefits of the core graph capability we have, which is still asset compliance. We can still do operational and analytical workloads, but you can now have graphs that can span across many, many machines. You can do 100 plus terabytes. So now we can be that graph intelligence layer on top of large-scale lake houses that are there in the system.

>> So explain the core driver for this. You said two years plus on this. Obviously your pedigree on data is well known. Thanks for explaining that. But the machine was one thing. What was the real driver? What's the core problem that this solves, and what's that mean for the growth of Neo4j?

Sudhir Hasbe

>> Absolutely. There are two sets of use cases I saw. When I came in, I started talking to a bunch of our largest banking and financial services customers, and their thing was the world is becoming more complex, lot more users. The data is becoming bigger and bigger. The fraud systems, and they call it fraud data lakes, these systems are becoming bigger and bigger and they needed us to break the boundary of the scale that we were providing and enable them to have tens to hundreds of terabytes of data in these fraud data lakes that they could do. Same thing in financial services. Other than financial services, in life sciences, drug discovery. Imagine the amount of new kinds of strains of viruses coming and what drugs you want to discover, what chemicals you have used in different tests and all that. So that was growing. So just this need for massive amounts of data was growing. But then what happened was in 2023, I joined. Within first six months, we saw this acceleration of adoption with gen AI and vectors started becoming more interesting. We implemented our vector support in our database in August, 2023. Within six months coming in, I was like, we rolled it out and then we had large customers, including some of the health sciences customers, Novo Nordisk, Pfizer, all of these other customers and they all were like, "Hey, we want to store tens of millions of documents in the database." So we will put the core entities from the documents into graphs, but all the documents will store somewhere else, maybe elastic, maybe something else as vectors. And we are like, "That makes no sense. We should have one platform for everything," because otherwise you have to maintain these systems and how do you connect them. So that was the second set of demand we started seeing. So we are like, "We have to solve this."

>> People were scaling faster on their architecture than what they could meet with their existing thinking.

Sudhir Hasbe

>> Exactly.

>> So this is what some people call the operational analytical divide, two worlds kind of collides together. How do you eliminate that divide? How does that roll out? Take us through your mindset of, like, okay, we know that for decades, running transactional systems and analytics were run differently. What's this mashup? Or is it a mashup? Was it a reset? Explain to us this divide and how this solves that.

Sudhir Hasbe

>> Yeah, we've always believed in building one system that could do operational and analytical workloads. But there's always, as you start scaling, you trade off. You say, "Oh, I will do row-based system for transactional and I will do column-based system for aggregates and OLAP," and stuff like that. That's true in the row-column world. In graph world, you don't have to have that kind of a divide. You can basically say, "I will do high-throughput transactions with nodes and relationships that you're writing, but I can also do global queries on graphs that can span across multiple machines." We do something called parallel run time. We launched that capability. We can actually parallelize workloads on even a single node or distribute it across multiple nodes, but we keep the graph schema at the core of it. And so this is where you can use Neo4j for fraud detection where you can actually identify fraudulent transactions, traverse through everything that transaction has gone through, which person, what all. So you're doing a traversal query. But you can also do find all people like this one. So identify a transaction, all transactions similar to this transaction with node embeddings and all. So that is like a massive scale graph analytical query. So we can support both.

>> The scope the scale of the graph database, just estimate, 100 petabytes? What is the scope?

Sudhir Hasbe

>> So I don't think it'll be 100 petabytes. I think these are billions and tens of billions of nodes and tens of billions of... So D&B actually is in billions of nodes, billions of relationships.

>> Dun & Bradstreet.

Sudhir Hasbe

>> Dun & Bradstreet. Yeah, Dun & Bradstreet uses us for that. Novo Nordisk has taken 66 million documents, extracted entities, and creates billions and billions of nodes out of them and relationships. So these are billions of them. I would say if lakes are in petabytes, like petabytes to tens of petabytes, you would see graphs more likely to be tens of terabytes to 100, 200 terabytes. I don't think you won't-

>> 100 terabytes plus would be a good range.

Sudhir Hasbe

>> Exactly. Right.

>> So what is the engineering breakthrough? Can you explain for the folks, this news obviously is big. Scale's a big theme. You mentioned your Google pedigree. SREs and they scaled up massively. What is the engineering breakthrough here?

Sudhir Hasbe

>> So the biggest problem in taking a graph, which is highly interconnected data, we were talking about graphs that you can understand who is connected to whom, the social networks and all that. How do you take that and shard it and break it into multiple machines? The problem is if you take it and just do spray and pray and you distribute it across multiple machines, you will pay the price in querying. When you query-

>> In terms of latency, accuracy.

Sudhir Hasbe

>> Latency, yeah. Latency, because when you come in and say, "Hey, tell me all the friends of John and their friends and all," and now suddenly you are hopping in hundreds of machines, which is not effective, it makes it slow. And the other thing is if you actually want to run massive amounts of query on different machines, how do you go ahead and do global queries is the other side. So what we did was we create an index on a single machine that has all the key nodes relationships without any data. And then all the data gets stripped off into all the machines that you want to store it. So we shard everything. So if you have a query that needs to traverse something, we can do it from one single index. And if you wanted to go ahead and do any detailed analysis across all the machines, we can spread it out across .

>> So you guys can get those little data points. So say take fraud detection. The social graph stuff's very easy to understand because the social graph, we all have one. But say fraud detection. I do something on my credit card and then something's happening over here. There's a lot of little data points. This allows for fast connections to these data points.

Sudhir Hasbe

>> Yeah. So if you want to say, "Hey, for this transaction, what all has happened. Tell me quickly, I want to traverse through like this," we can immediately get it from the index and quickly show you.

>> Yeah, John's in Vegas right now. He's not withdrawing money from an ATM in Palo Alto.

Sudhir Hasbe

>> Exactly.

>> That's a quick data point.

Sudhir Hasbe

>> Quick data point. We can do that from index. But let's say you want to know, "No, no, no. Tell me everything that John has done over a period of all that," we can spread that query out to all the nodes. Wherever all the detailed data is stored, we can bring it together and give it to you. So you get benefits of scalability but also performance at scale on the graph query.

>> Well, certainly great news, love the breakthrough. I think, again, this is why we are having that conversation because we need more scale, low latency. I have to ask you, why does this matter in the generative AI era? Because people watch the TVs, "Oh we're resetting AI." No, there's no resetting of AI. It's still growing. Now a lot of people are kind of disappointed by not seeing massive money making, but there's great progress. We're seeing all the AI infrastructure. That's setting the table for the data layer, which you guys are working on at scale. Why does the graph architecture matter in the generative AI world?

Sudhir Hasbe

>> Yeah, I think the most important thing with generative AI, and especially with agentic and all and autonomous systems, is accuracy. If you cannot give accurate information back from your data with natural language or whatever, you're not going to use the system. And there has to be an evolution of how do you make it more accurate. And graphs are really good at providing the context the large language models need, reason over it, and give you more accurate information. So that's fundamentally why it is there. I say everybody focuses on data, but it's the knowledge what large language models need. So we allow people to convert data to knowledge as knowledge graphs, which then can be an input to a large language model for reasoning and answering questions. So that's why I think graphs are critical for making AI systems way more accurate.

>> And it's interesting, a lot of the big trends in the enterprise side right now, you're seeing a lot of development on premises. It's not so much repatriation from the cloud, it's more of that's what the data is. Of course it's in the cloud too, but you're starting to see that hybrid distributed computing architecture. So you're going to have the need to have access to that data. As an analyst, I'd ask you, and I'd love to get your perspective on what other analysts are saying about this news, but my would ask would be what are the use cases that you're seeing out of the gate? What some of the early customer response has been? And how would you position this to a prospect or existing customer?

Sudhir Hasbe

>> Yeah, let me take three examples on this, three types of patterns I'm seeing. The first pattern that we have is there are these large lakes, data lakes and lake houses. There's Databricks, Snowflake, BigQuery, Fabric, all of these platforms, there's tons of data. We consolidate all of this, break data silos, move it into these platforms and you want to now make sense out of it. You want to agentic agents and agentic applications. As I said, you had to convert that data into knowledge to build next generation applications. We are seeing tons of interest in that pattern. We just hired our global VP of field engineering from Databricks. I'm super excited of him coming over. But this is both of our theories. Now we can enable that more easily and people will be able to do really cool stuff. Whether it's drug discovery-

>> You're attracting some talent with this.

Sudhir Hasbe

>> Yeah, we are attracting a lot of talent.

>> So it's an engineering breakthrough. That's why I wanted to ask that question.

Sudhir Hasbe

>> Exactly.

>> All right, so here's my agent question. A lot of people look at agents and they see the LLMs, large language models and then obviously, well, the hallucinations. That's all kind of old news. But we've been saying on theCUBE, obviously agents are only as good as the data they can get in there fed to them.

Sudhir Hasbe

>> Yeah.

>> It's like a walkie talkie. What's the data? Latency matters. What is the requirement for agentic to work from a graph standpoint? Is it latency? How would you explain that concept?

Sudhir Hasbe

>> So this is what I was saying. Converting data to knowledge requires you to go ahead and transform it and make it right in the pattern like relationships, but context and communities that you can build out within the data sets. So that's one. The second is low latency. If you are doing real-time decision-making with agents, you can't just sit there, wait for minutes or longer. And the last thing is scalability. The data is big, so how do you scale? So those are the three things. Coming back to the use cases you asked me. So first is that pattern. Second, we are seeing tons of use cases where people have data which is structured in one silo and unstructured in another silo. People are trying to blend these things. So I will give you an example. Manufacturing. Majority of the manufacturers, whether it's the biggest aircraft carriers, almost every automotive company, they have their bill of materials in a graph in Neo4j. But their parts manuals and manuals and details are sitting somewhere in drives. Now we can take that, vectorize it, put it together and enable really advanced question answers, advanced use cases with agentic and all. So that actually we are seeing tons of that. We are also seeing this other end, which is a legal space where just purely unstructured data, converting it into... Do a vector search, you're going to get a lot of hallucinations. But now we can extract entities, create a graph out of the core concepts in the document, and the document gets vectorized in it. We can do it now at billions and billions of vectors. So I think all of that enables another set of use cases. So pure structured, unstructured coming together. And then on the lakes, how can we add more value on top of those data sets that have been created?

>> So basically you make sense of the data mess that's happening out there.

Sudhir Hasbe

>> Yes.

>> There's a lot of mess.

Sudhir Hasbe

>> We create knowledge out of data.

>> It's like dirty. It's all dirty. It's like a house that needs to be cleaned up.

Sudhir Hasbe

>> Yes.

>> So I'm going to ask you, so one of the things I've been fascinated with, I'd love to get your thoughts on this because I studied the history of the computer industry with search Web 1.0. Google, which you mentioned you worked at. The big search engine gen one was contextual and behavioral. Agents, their behavior is to get attached done.

Sudhir Hasbe

>> Yeah.

>> Okay, context is very, very important.

Sudhir Hasbe

>> Absolutely.

>> This is where I think graphs shine. I want to get your thoughts on this. So how would you explain that concept to a person who's sitting in an environment saying, "I got a data lake over here, I got a little Snowflake here." I guess, let me rephrase. What's the environment look like for the ideal candidate for Neo4j? Are they sitting on a data lake or they want to move to a database? What's the environmental tell signs of I need this?

Sudhir Hasbe

>> I would actually say start from the use cases and business side, right? Just one thing. Yes, the first version of Google search was one. What was the second version? 2012, A knowledge graph. They built a knowledge graph to go ahead and empower the search. But taking it forward, if you think about siloed environment, even if you put all the data from different systems and put it into a lake, you're still like thousands and thousands of tables. That's not an actually usable system for a lot of use cases with agents. Agents need to be smarter. And if you have a customer table, an orders table, a supplier table, a product table, and then you have some joint tables, how is an LLM going to reason over it. Versus in graph, we provide context across them. Relationships provide, a product was bought by this customer. Customer gave this review. This product was returned on this date. It's all in the graph as a context for LLM or the model, the agent. Agent can come and reason over it and say, "Hey, is this customer happy or not? Why have they done something? What next product should we recommend them?" It's all based on everything is in the context that is available. And then we can run some smart algorithms where in graph algorithms we have linked predictions. We can go ahead and predict this customer is more likely to be related to this one because they're having a similar address and business.

>> There's more reasoning opportunities-

Sudhir Hasbe

>> Exactly right....

>> with graphs.

Sudhir Hasbe

>> Yeah, exactly.

>> So let me ask you a question. So if I'm sitting in a mess, so my environment is I got a lot of requests for modernization transformation. I got tables and structured data with unstructured data. I might have a data lake or something going on. Okay, my next question would be, okay, scope how I get there? Is it like I call up Neo4j, I connect to a cloud service? Do I have to install stuff? Do I run stuff? What do I do? What's the requirements to get up and running?

Sudhir Hasbe

>> So first of all, Neo4j, we have a managed cloud service that runs on all clouds. Neo4j runs on-prem. Neo4j is wherever you want to run, we can run it. The next thing after that is how do you take your data mess or data ocean, or I have been told in past that it's a data swamp. And I don't have one swamp, I have many swamps. You basically have all the data.

>> Well it's a mess, but a lot of data.

Sudhir Hasbe

>> Yeah, a lot of data. So how do you take that from that data environment into a knowledge, which is what are the key entities? What are the key relationships? How do you go ahead and structure it? We can help with that. So one of the things that-

>> In tooling or professional services, or both?

Sudhir Hasbe

>> Both. So we are basically building tooling where you can take data from Databricks, Snowflake, any of the lakes, convert it into a graph data model. This is all AI-powered data modeling. We have trained a model which actually does a good job of taking standard data into a graph model. And within five minutes I can show you a model, I can explain what it does and all. So we can do that, but we also have our own technical folks. So one of the things that I'm responsible for now is customer success. So we have our teams that actually can come in, help you with these transformation from data to knowledge, and building agents on top of it.

>> So you have a new person from Databricks on board, you got the launch. Infinite graphs. InfiniGraph.

Sudhir Hasbe

>> InfiniGraph.

>> What do you want people to remember about this launch? What's the bumper sticker? What's the most important concept? What do you want people to look at this and say, "This is important, because"?

Sudhir Hasbe

>> I think the next big revolution in business is going to be powered by AI. To get value from AI, you need knowledge. We can provide knowledge on top of your data at infinite scale. So technology is not going to be a constraint and we can help people become successful with InfiniGraph plus all the agent work that we are doing. So I would say yeah, let's go derive value from AI.

>> Make it happen. Sudhir, great to have you on and congratulations on being president. Obviously you have the product chops, you're still the CPO, chief product officer. But as president, how do you see your vision for the growth? Because now going to market, there's obviously a new opportunity to go next level. Multiple years on this project. What's your plan? What's your focus? What are you optimizing for?

Sudhir Hasbe

>> Just one thing, customer success. So I've been a product person for many, many years now. I've done that in many roles. But the thing I'm very passionate about is how do you make customers successful? And one of the biggest things, I talked to Emil, he's the CEO founder of Neo4j, is I want to expand my work at Neo4j on making customers successful. So this is where now my total focus, other than building the best technology, is how can we work with customers, make them successful, derive value from what we are building. Especially in the AI world, I think there's a lot of opportunity for me to learn from our customers, but help them be successful.

>> Emil's great. Love the history of Neo4j. And again, I think it's just the tip of the iceberg. Growth ahead. We'll be tracking it. Thanks for coming on. I appreciate it.

Sudhir Hasbe

>> Thank you, John. Really appreciate it.

>> Breaking news here, a mixture of expert series, having the Neo4j experts explain next generation graph database is certainly very relevant for real time low latency data feeding agencies. Data platforms are super important. As the AI infrastructure continues to get smarter and faster and more scalable, entering the data platform era is here, and of course agents will flower out of that as benefits. Of course, theCUBE's got all the data here for you. Thanks for watching.