Future of Data Platforms Summit | Frederic Van Haren, HighFens

Clips
More from Future of Data Platforms Summit

Frederic Van Haren

CTO & Founder

HighFens

Metadata management moves to the center of AI scale challenges

Metadata management has become the practical dividing line between AI systems that scale and those that stall.As organizations push AI from experimentation into sustained production, the limiting factor is no longer models, but visibility into sprawling data estates. Survey results and field experience from companies such as HighFens Inc. show that without usable metadata, operational expansion increases cost and complexity instead of value, according to Frederic Van Haren (pictured), chief technology officer and founder of HighFens, a technology consultancy and services firm specializing in high-performance computing, AI infrastructure and big data.“AI by itself is all about scaling

play_circle_outline Scaling AI and Data Platforms: Navigating the Challenges of Rapid Data Growth for Organizations

play_circle_outline 82% favor best of breed solutions instead of relying solely on one vendor.

play_circle_outline Combatting Vendor Lock-In: The Role of Open Formats and Addressing Data Silos in Modern Organizations

play_circle_outline Metadata management will become increasingly crucial for organizations in 2026.

Info
Transcript

Frederic Van Haren, HighFens

Frederic Van Haren

CTO & Founder HighFens

In this episode of the Future of Data Platforms event, Frederic Van Haren, chief technology officer and founder of HighFens, joins theCUBE’s Rob Strechay to dissect the results of a recent survey covering 436 AI and data leaders. The discussion centers on the massive hurdles involved in scaling AI infrastructure, a primary challenge identified by 65% of respondents. Van Haren explains why simply accumulating data is insufficient and details the exponential infrastructure growth – often doubling every 18 months – that successful organizations must anticipate. ... Read more

explore Keep Exploring

What is the primary challenge organizations face when scaling their AI infrastructure, particularly regarding data platforms? add

What percentage of organizations believe data silos pose a significant challenge, and what is the general sentiment towards open formats in relation to data platforms? add

What are the opinions on the significance of open table formats and open formats in relation to data silos and reducing lock-in? add

What were the anticipated challenges and developments in storage infrastructure and metadata management for 2026? add

bolt Powered by CUBE AI

Frederic Van Haren, HighFens

search

Rob Strechay

>> Hello, and welcome back to the Future of Data Platform Summit, this update. Since we got together on the subject of data platforms, a lot has been happening over the past six months. We conducted a survey and we received 436 responses from organizations in the US, UK, Australia, primarily from AI/ML leads, CDOs, CIOs, CTOs, data engineers, and platform engineers. In this episode, I'm joined by Frederic Van Haren, who's the CTO and founder of HighFens to really discuss the outcomes of the survey and how it stacks up with what he is seeing with organizations that are really in the thick of preparing their data platforms for AI. Welcome on board, Frederic.

Frederic Van Haren

>> Well, thanks for having me.

Rob Strechay

>> It's always great to have you. I mean, beyond being on here, we've known each other for a long time, and I think we've been in the thick of this together for a long time, but you're really out there helping customers really build out their AI infrastructure. One of the things that is really interesting is that as part of the study, scaling AI is identified as a primary challenge by 65% of respondents, and this means really scaling their data platforms for AI in particular. The first question I have for you based on the survey results are, what are you seeing in scaling AI from a data platform perspective?

Frederic Van Haren

>> Right. So I mean, AI by itself is all about scaling, right? And so here's how it goes. Somebody comes up with an idea, they have some data, they build a model, and the most valuable data you can collect is actually the data from your customers. And so it keeps on going, right? So you have a model, you get new customer data, you improve. So scaling becomes a real challenge. And as you accumulate all of that data, just accumulating that data doesn't help by itself, you have to process it. So what does that really mean is that even though your data platform has to grow in growth, you also have to follow up with all the infrastructure. So yes, it is definitely a challenge, and every organization that is successful will have to deal with it. Just from my own past, we used to double everything every 18 months. I mean, you can imagine how challenging this can be, because you can design a system for something and then you don't expect to double so quickly.

Rob Strechay

>> Yeah. And I would assume that you're still seeing that, that that data, probably even faster, at this point, than even every 18 months.

Frederic Van Haren

>> Right. Definitely. I mean, the amount of data you have access to, I think today there are organizations that have to drop data, simply because they can't afford to not only store it, but also process all of that data.

Rob Strechay

>> Yeah. We'll come back to that, because I think that was another one of the very interesting things that we'll hit on here, is the second challenge identified in the study was that data quality is definitely a primary issue, which was a challenge for 51% of the respondents. How are organizations really addressing the quality issue?

Frederic Van Haren

>> Yeah. I think the problem with data quality is it's not something where you can look at the data and say, "This is quality data for me." The first thing you need to do is, at best, sample the data and process. So that takes time, that takes a lot of effort. But maybe it's interesting to see a little bit about the quality data. What does that mean? And to me, that's data that positively contributes to the end product. And so what does that really mean, is let's assume I'm building a model to recognize pictures of cats and dogs, and somebody gives me pictures of elephants and giraffes. So that could be very good data for recognizing giraffes and elephants, but to me, that's not data quality. To somebody else, it could be data quality. The other example I could use is, let's assume I already have 10 petabytes of data and I'm looking for new sets of data. Somebody could give me another 10 petabytes of data, but let's assume that's a remix of the data I already have. So in comparison, you would say it's data quality, but if you give me the same data, I don't consider that as quality data. And so it's very, very difficult. There is no really magic way to figure this out. I think the openness and sharing could be very interesting, where let's assume somebody has like a catalog, like an Amazon.com for data, let's say, right? Where you can kind of guess what the data quality is. I don't think we're there. I'm aware of certain open source communities that keep track of a data catalog, but from a competitive standpoint, if everybody's using the same data, then everybody's kind of building the same product, relatively. So in short, not easy, and how do you measure it?

Rob Strechay

>> Yeah. And I know there's a lot of stuff going on with data observability, data quality. There's a lot of startups in this space. We're seeing some that are actually getting acquired by some of the larger data platform companies that are out there. I see this absolutely being one of the largest issues going into this year, is that people are going to be focused on it. In fact, there was a study done that LLMs can't be deterministic, and that they just never will be from that perspective, because of how they actually do the math, and how they actually look at the next best answer, and all of this good stuff that they're really good at. What are you seeing? Are you seeing organizations, because, I mean, you work with a lot of people from the infrastructure all the way to the data science folks. How are they wrapping their hands around this today?

Frederic Van Haren

>> So if we could look in the future, a lot of them are actually looking at data generation as opposed to data collection. So imagine that you can generate quality data based on what you know and what you have and looking around in the market, it's a lot easier and it's a much better investment, but everybody is still doing what they can to get their hands to whatever data they can have. I think it's still the case. But if I had to foresee a little bit, or guess, I think data creation is definitely something people should be looking at.

Rob Strechay

>> And there's definitely some open source projects going on with that as well, which is great. So when I start to look at the next thing that was very interesting from the study was that 82% favor selecting best of breed solutions rather than relying on a single vendor from a data platform perspective. There seems to be an ability to kind of deal with what we know is out there, which is data silos. And that was kind of juxtapositioned against one of the other questions where we asked about managing these data silos, and 34% of organizations report that data silos continue to pose a significant challenge. That, with another question we asked, which was, "Hey, how do you feel about open table formats and open formats and things like that?" All the rage right now, 87% feel that open formats are important in reducing lock-in. A lot of things there, from best of breed, to data silos, to open table formats. What are your thoughts?

Frederic Van Haren

>> Yeah, I think data silos, we don't seem to be able to get away from it. I think it's a fact of life. Certainly on the training side, where people are being creative, and take data and make multiple copies and modify, and how do you find out what's happening, right? And maybe that's a whole different topic where tools can help, but data silos, I don't see it going away. I think it's definitely a problem. And it's almost like a human problem, not necessarily a machine problem, where it's like a sprawl of data. I mean, the holy grail for me is being able to fingerprint data, right? So where I can find data and say, "This is similar data or it's different data." We're definitely not there yet. But once we're dealing with the data silos and we're trying to solve it, I think the open format is really key. You can get data in .csv formats, or in .zip files, or Parquet or CAR files, you name it. The amount of time people spend moving and going from one format to another, it's huge, right? And it's not a problem that you can underestimate. So a lot of organizations are looking at like an open format, or format of exchanging data. I mean, one example we can see on the other side of the fence, which is MCP, right? So what MCP has done is provide you the ability to exchange data through standards, and APIs, and a standard way of exchanging data. We should see the similar thing on the data platform side, right? Let's consolidate on an open format and get it done. We have customers who spend probably 40% of their time just massaging data from format A to format B. It sounds really simple, but if you have petabytes of data and you have to read that data, you have to reprocess it. So we talked about the open platforms, we talked a little bit about the data silos. What about the best of breed? Data platforms, it's such a fast-moving environment. I'm really fascinated by the data life management around it, meaning how do you move data left and right? What do you do with it? I mean, we hear people saying bring the data to the compute or bring the compute to the data. I mean, in the end, it's roughly the same thing. It's a challenge. And I think that you have to go with, if you want to call it best of breed, but you have to go with the solution that puts all these pieces together. Is it coming from one vendor? I don't know. I feel like they leapfrog each other. So I have a tendency to say, "Ask me today, ask me tomorrow, I might give you a different answer."

Rob Strechay

>> Yeah. I mean, I look at it as they were saying a lot about, "Hey, we want best of breed at the storage layer, at the management layer, at the metadata, GRC layer," things of that nature, to your exact point, where, versus going with all from one vendor. And I think it's really hard to get an entire data platform stack from one vendor and be able to use it across your entire estate from that. Because I mean, you're seeing people with ... I mean, again, it goes back to the whole data silos, where maybe I'm putting different tooling on top of those different data silos.

Frederic Van Haren

>> Yeah. And we see the same thing on the hardware side, right? So on the hardware side, we talk about compute, there's network, and then storage or data platforms. The problem is it's not like in the traditional corporate IT days where you say, "I want the best of breed of storage and the best of breed of network." It's really, how does the system work all together? And the challenge there is you want these components to be interchangeable, which is another problem. So yeah, I mean, I do like it. Every day is a new day, every day is a different answer, but I think that's kind of AI, right? It's evolving, data platforms are changing really quickly.

Rob Strechay

>> I like it. I mean, there seems to be an announcement every week in this space, to put it mildly. One of the things that I know we both have seen throughout our careers is that, and was seen in the study, was that 67% of data is stored in either cloud or hybrid environments, with 33% still being on-premise. How is hybrid storage really shaping how and where organizations are building their AI? Because again, the data's not necessarily all in one place, but to your point, and maybe iceberg, an iceberg format tables helps with this from a data platform perspective, longer-term, but being able to bring it together in a more hybrid manner.

Frederic Van Haren

>> Right. I think our organizations look at it more from a practical standpoint, is where's the data created? And that's typically where it kind of lives as your primary, and then where do you use the data? And so you could look at the market as, at least from our view, like training and inference, right? So inference, most of the people are using public clouds for inference, because there's the flexibility, the elasticity. Sure, per unit might be more expensive than on-prem, but the reality is that the elasticity makes it a lot more easier for their business model. So you can assume that on the inference side, there's a lot more data being stored in the public cloud. What we see on the training side is kind of the opposite. So on the training side, we see a lot of people storing their data locally, or private cloud, if you wish, where they have maybe a secondary copy or where they burst out to the public cloud where there is data. So I'm not surprised that a lot of data ends up in the public cloud, but it's really split between training and inference.

Rob Strechay

>> I think that makes a lot of sense. And I think in the organizations I've been talking to, especially highly distributed organizations, when you think retail, M&E, media and entertainment, where they're doing things in different places, and in healthcare is another one, where you have hospitals that have to be able to do inference locally and things like that, I think that inference comes to the edge. We saw this at CES just earlier this month, that a lot of that and a lot of chips are actually being aimed at that. Even people like Nvidia are aiming at that. But getting the data to that, and where do you store the data, how much of the data do you just update the models? How much of the data is actually stored there versus just processed and stored somewhere else? I think that's still up for debate.

Frederic Van Haren

>> Right. And I think that's where data platforms are going, is where's my data to start with? Where you heavily focus on the metadata. So you need somebody who kind of manages the metadata across the board and then move forward from there.

Rob Strechay

>> Yeah. Yeah. Well, we could talk for hours, I know because we have, but let's kind of take this a level up now. We're into 2026 here. What do you see as some of your predictions going out this year in the realm of data and AI, and what's going to be happening based on what you saw over 2025?

Frederic Van Haren

>> Yeah. I think let's start with the practical problem, which is shortage of storage infrastructure. So that's going to be a pain point for 2026. More from a practical standpoint, I do believe tools, right? So I talked a little bit about metadata. So when I talk about storage and data platforms, I see two things. One is the metadata, which is what is it, where is it? All kind of information around it. All the technical metadata. And then you have the actual, the bits and the bytes where you store things. And I think that in 2026, there's going to be a lot more focus on metadata and metadata management. I think that is really key. A lot of the people that have provided storage solutions in the past have ignored this. So give you an example, a lot of customers have billions of files, and one day they could wake up and say, "Well, that's great, but what has changed since yesterday?" And traditionally they say, "Oh, you have access to all of the metadata," but yeah, but it's going to take you 12 hours to walk through the metadata. And by the time you're done, there might be no more metadata, like it maybe have changed. And so you want tools around that. So in short, have your focus on metadata, where's my data? Hopefully also on the formats, where data platforms kind of solidify and stick with an open format. Those two components are really close to my heart. And that's also what I push with my customers, right?

Rob Strechay

>> Yeah. No, I think that, again, I made some predictions about a month ago, and I think one of my big predictions was the fact that efficiency was going to be that key. And I think that leads to your metadata, and the technical metadata, and understanding that. I know a lot of the organizations, a lot of the ones that are on this summit are focusing on using AI as things come in and doing that. But to your point, the change rate and other things, of things that are active, and that's the unstructured versus the structured and how they come together. In fact, in the survey, and we'll have some of this data out afterwards, we talked about the fact that people thought they actually had a good handle on the merging of their structured and their unstructured data, which may have been a question issue with leading the witness, but you always say, "Oh, of course I'm great at that." But I think that it was something in the 60% range thought that they had a really good handle on managing their structured and unstructured data. I think this year it's all about efficiencies, because to your point, getting storage is going to be a big issue, doing more with what you have. To your point about how do you age things out, how do you actually use hybrid? Because I think the clouds are going to still have what they have, and not run out of storage per se. So I think you'll see a larger push for hybrid this year than even last year, but I think it'll be also people making that determination, where am I doing my inference? Am I doing it at the edge? Where do I store the data? How much of my customer data do I keep? Do I de-duplicate it? I know we've talked before about the whole EV and the AV, the autonomous vehicle going up the strip in Las Vegas, and it sees the same palm tree how many times, how do you dedupe this data and things like that? I think that will be in greater focus for the corporate folks with their data as well, but-

Frederic Van Haren

>> Yeah, I totally agree. I mean, when I was a customer, I always used to tell my team, if you don't know what data you have, it's the same thing as not having the data at all. And I think that's what data platforms are kind of trying to fix for you. But it's great. I think there's a lot that can be done. I think that the market is maturing really fast. On the other hand, the amount of data kind of keeps on growing. So that's what your first question was all about, scaling, right? The scale will still be one of the biggest challenges.

Rob Strechay

>> Yeah. And funny enough, we didn't talk about it, but we'll hit it here, is that one of the biggest, I think, challenges that organizations, these 436 organizations saw was a skills gap. And I think that's going to be a big thing as well this year.

Frederic Van Haren

>> Yeah, very good point. I mean, skills were always an issue, right? Once upon a time, there were not enough people graduating from college and universities with knowledge about AI. And then so that kind of changed. So suddenly there was a flood to people going to these courses, and then Google, Meta, and others started kind of hiring all these people. So now you have, again, a shortage. I think what people are having issues today with is when they look for skills, because the technology is going so fast, that they don't really find people with that particular skill. So we see a lot of people hiring on potential, meaning they know enough about AI to be dangerous, but not necessarily specific in that area. And then a pitch for HighFens, I mean, the reality is we exist because there's a shortage and we have the expertise, right? I mean, it takes a long time to build up the expertise for organizations like ourselves. So when we go into a customer, there's a lot of pitfalls we can avoid, and help this customer get to a point where they can deliver models a lot faster. I mean, some people still want to hire, but it's a challenge, right? You're kind of chasing your own tail. And what also frustrating for organizations is once they hire on potential and they bring that individual to a certain level, then the big guys come around and offer them a significant amount of money and then they're gone. So it's almost like organizations look at people as renting them for a while, and then they lose them. And so it's not easy.

Rob Strechay

>> Totally agree. Well, this is great. I really thank you, Frederic. This has been fantastic. I think, again, your insights for being in the trenches, building out these very complicated systems to provide AI and get them to ROI on their AI is always valued. So thanks for coming on board.

Frederic Van Haren

>> Well, thanks for having me. It was a great session as always.

Rob Strechay

>> Yes. And thank you for joining us on this Future of Data Platform Summit Update. More to come. Stay tuned with some fresh sessions next.