Waleed Atallah of Makora, co-founder and chief executive officer, joins theCUBE Research hosts Gemma Allen and John Furrier to discuss artificial intelligence, AI, factories at NYSE Wired. Atallah brings deep expertise in GPU and TPU kernel optimization, performance engineering and serving open-source models. The conversation examines GPU supply constraints, the role of kernels versus the CUDA moat, hardware-agnostic strategies and Makora's approach to delivering fast cost-efficient inference across diverse accelerators.
Atallah emphasizes that small kernel and performance gains scale to substantial cost savings. They note that a 2–3% utilization improvement can free the equivalent capacity of thousands of GPUs in large clusters. They highlight Makora's inference platform, which delivers faster and lower-cost tokens with open-source models and enables cost-efficient inference across GPUs and TPUs. They predict consolidation or strategic partnerships as GPU supply constraints drive providers toward acquisition or alliances. The discussion addresses data center compute, AI infrastructure, CUDA performance considerations and tokenomics for model serving.
This episode provides actionable insights for data center operators, performance engineers and AI infrastructure teams seeking to maximize inference throughput and cost-efficiency across accelerators.
Forgot Password
Almost there!
We just sent you a verification email. Please verify your account to gain access to
theCUBE + NYSE Wired: AI Factories - Data Centers of the Future. If you don’t think you received an email check your
spam folder.
Sign in to AI Factories - Data Centers of the Future.
In order to sign in, enter the email address you used to registered for the event. Once completed, you will receive an email with a verification link. Open the link to automatically sign into the site.
Register for AI Factories - Data Centers of the Future
Please fill out the information below. You will receive an email with a verification link confirming your registration. Click the link to automatically sign into the site.
You’re almost there!
We just sent you a verification email. Please click the verification button in the email. Once your email address is verified, you will have full access to all event content for AI Factories - Data Centers of the Future.
I want my badge and interests to be visible to all attendees.
Checking this box will display your presense on the attendees list, view your profile and allow other attendees to contact you via 1-1 chat. Read the Privacy Policy. At any time, you can choose to disable this preference.
Select your Interests!
add
Upload your photo
Uploading..
OR
Connect via Twitter
Connect via Linkedin
EDIT PASSWORD
Share
Forgot Password
Almost there!
We just sent you a verification email. Please verify your account to gain access to
theCUBE + NYSE Wired: AI Factories - Data Centers of the Future. If you don’t think you received an email check your
spam folder.
Sign in to AI Factories - Data Centers of the Future.
In order to sign in, enter the email address you used to registered for the event. Once completed, you will receive an email with a verification link. Open the link to automatically sign into the site.
Sign in to gain access to theCUBE + NYSE Wired: AI Factories - Data Centers of the Future
Please sign in with LinkedIn to continue to theCUBE + NYSE Wired: AI Factories - Data Centers of the Future. Signing in with LinkedIn ensures a professional environment.
Are you sure you want to remove access rights for this user?
Details
Manage Access
email address
Community Invitation
Waleed Atallah, Makora
Waleed Atallah of Makora, co-founder and chief executive officer, joins theCUBE
Research hosts Gemma Allen and John Furrier to discuss artificial intelligence,
AI, factories at NYSE Wired. Atallah brings deep expertise in GPU and TPU kernel
optimization, performance engineering and serving open-source models. The
conversation examines GPU supply constraints, the role of kernels versus the
CUDA moat, hardware-agnostic strategies and Makora's approach to delivering fast
cost-efficient inference across diverse accelerators. Atallah emphasizes that
small kernel and performance gains scale to substantial cost savings. They note
that a 2–3% utilization improvement can free the equivalent capacity of
thousands of GPUs in large clusters. They highlight Makora's inference platform,
which delivers faster and lower-cost tokens with open-source models and enables
cost-efficient inference across GPUs and TPUs. They predict consolidation or
strategic partnerships as GPU supply constraints drive providers toward
acquisition or alliances. The discussion addresses data center compute, AI
infrastructure, CUDA performance considerations and tokenomics for model
serving. This episode provides actionable insights for data center operators,
performance engineers and AI infrastructure teams seeking to maximize inference
throughput and cost-efficiency across accelerators.
>> Palo Alto studio connecting Silicon Valley and Wall Street. I'm John Furrier, host of theCUBE, here with Dave Vellante, my co-host.
Gemma Allen
>> Welcome back to theCUBE Studio. I'm Gemma Allen here live at the New York Stock Exchange. This is AI Factories, one of our programs with NYSE Wired and joining me now is Waleed Atallah, co-founder and CEO of Makora. Welcome, Waleed.
Waleed Atallah
>> Thank you for having me, Gemma.
Gemma Allen
>> So first of all, I'm going to start with something I read because really this made me laugh. You wrote a blog called GPU Go Brrrrr, meaning like a cranking of a physical, something that's kind of running out of steam a little bit, which really defines what you hope to achieve with Makora.
Waleed Atallah
>> That's right.
Gemma Allen
>> Start there. Break that down.
Waleed Atallah
>> Well, GPU Go Brrrrr as the people who used to have their GPU sitting on their desktop now, once you get it really cranking, the fans start really turning on. You hear the brr. And you know how well you're utilizing your GPU by how loud that fan is.
Gemma Allen
>> So Makora really is all about helping. It's not about there not being enough chips. It's about maximizing the GPUs that you have, right?
Waleed Atallah
>> Yeah.
Gemma Allen
>> That is really the premise of this business.
Waleed Atallah
>> Look, I mean, I think the whole industry is compute starved at this point. From my conversations with other clouds, with other providers, the backlog on getting new GPUs in the door is something like 12 months.
Gemma Allen
>> Wow.
Waleed Atallah
>> Especially for some of the latest chips.
Gemma Allen
>> Okay.
Waleed Atallah
>> So you want to make the most of what you have now and we also want to enable people to make the most of other types of chips beyond just GPUs. So we're talking Google TPU, Amazon Trainium, AMD, all these new chip startups too, Tenstorrent, Cerebras, you name it.
Gemma Allen
>> Yeah. So you're completely TPU agnostic. You're like Switzerland.
Waleed Atallah
>> That's right.
Gemma Allen
>> I love it. So in terms of what it takes though to truly make the most of the physical metal you have, go kind of deep on this from a tech perspective. What exactly is involved in this?
Waleed Atallah
>> Yeah. So the most important piece is a chunk of software called the kernel, the GPU kernel, or the TPU kernel. Essentially it's how you map an algorithm, like a model like ChatGPT or Claude Code or anything, how you map that algorithm to the hardware itself.
Gemma Allen
>> Okay.
Waleed Atallah
>> When I hit enter on ChatGPT and I ask it a question, it runs something like a trillion math operations before it returns back to you saying, "Hi, how are you? I'm good." And the kernel is almost described as the order that you execute those trillion operations. So there's a faster way to do it, there's a slower way to do it, and there's an impossibly slow way to do it. Having these kernels for a variety of different hardware platforms enables you to use those new hardware platforms. This is a big part of what gave NVIDIA, people call it the CUDA moat.
Gemma Allen
>> Yeah, for sure.
Waleed Atallah
>> For decades they had been developing their CUDA kernels.
Gemma Allen
>> Yeah.
Waleed Atallah
>> They made them available to everybody so nobody had to worry about that. For new chips, they have just a huge gap in those kernels.
Gemma Allen
>> Now? Okay.
Waleed Atallah
>> But nowadays with AI, closing that gap has been easier and faster than ever.
Gemma Allen
>> So I mean we can certainly get into CUDA and comparing that moat to this current moment, but what are the economics of this? What's the math on this? What is the waste versus the opportunity of running this in a more optimized way, getting the best in that GPU kernel, as you say? What sort of money's being left on the table? Break that down.
Waleed Atallah
>> Oh, so when we talk about performance engineering in terms of CUDA kernel optimization or other types of optimizations, like for example, if you were at like Facebook or Google and you got a 2 or 3% utilization improvement, they would celebrate you up and down the halls because that 2 or 3% on a cluster of 100,000 GPUs, we're talking about thousands of unlocked GPUs essentially, which can result in literally tens of millions of dollars in saving, just from reordering software a little bit. So especially for large scale workloads, AI models generate tokens and the number of tokens you can generate per second directly drives your bottom line, your margin in terms of how much money you can actually make on your AI product or service. So improving that even by small percentages, we aim for bigger percentages obviously, but it's really consequential and that's why people spend so much time and effort on performance engineering.
Gemma Allen
>> Wow. I saw a video last week on, I was at Dell Tech World, of somebody basically asking an AI model to find a paperclip, like a paperclip and they spent like the best paperclip possible and it spent like $100 on tokens on, I think it might have been Claude or one of those, to get the best results on a paperclip that costs like what, I don't know, a couple of cents.
Waleed Atallah
>> I do.
Gemma Allen
>> So the tokenomics conversation right now is feeling like somewhat insane, right? It's almost like a mania that's happening out there. We hear about token leaderboards. The more you use, the smarter and more determined you are. So I mean, what are your thoughts and like what does it mean for your company? Because there also seems to be this element to this conversation, especially as an enterprise buyer or an atypical buyer in this stack that you kind of don't know what you don't know as well, right?
Waleed Atallah
>> So I think especially when Claude Code and Codex have ... I feel like they have their major unlock moments every now and then. And I do think that the beginning of this year, probably January, February timeframe, a whole new set of applications became reasonably possible with Claude Code. That's when we saw from Uber to Amazon, people talking about token leaderboards, token maxing.
Gemma Allen
>> Yeah.
Waleed Atallah
>> Right? And then people quickly realized that that's not exactly the best metric. I mean, I could spend a million tokens doing paperclip experiments. But ultimately, people want to get to results and how many tokens or how many dollars does it take to get to that result? If you can increase your efficiency of the GPUs, we can really start cutting into those costs. And especially with really powerful open source models now and delivered on platforms like ours, you can get really competitive performance, really competitive intelligence for as much as 10, 20, 30 times less than using these proprietary models.
Gemma Allen
>> Oh, wow.
Waleed Atallah
>> So there's now just a lot more options, especially when the performance engineering starts to extend beyond just writing kernels and into serving entire models.
Gemma Allen
>> So give me the profile of a typical buyer here. Who are you guys targeting predominantly? Where is the low hanging fruit in the space right now?
Waleed Atallah
>> So we have two products. One of them launched recently as last week, but previously before that we had built a product called MakoraGenerate. This was a tool that could help you write these GPU kernels for a variety of hardware platforms, and our customers were chip companies. We'd worked with AMD, we've worked with Google, we've worked with Tenstorrent, and we develop kernels for those chips and those were really nice and fruitful. But ultimately we wanted to get into like higher up the stack. Instead of just writing the kernel, can we serve an entire model? So the product we launched last week is our inference platform where we serve full open source models and you can just pay per token. And for this, we're really targeting people who need the fast tokens. A really good example, Claude Opus 4.6, really powerful, really good model. They also have Claude Opus 4.6 fast. Same exact model, two times the speed, six times the price. There were people willing to pay six times for premium tokens, faster tokens. And so we're building an inference platform that gives people the option for those premium fast tokens across a wider variety of models.
Gemma Allen
>> Wow. So you mentioned CUDA and I know you mentioned AMD, so I presume you use , but you're not really competing there though, right? You're competing at a layer on that, right? It's a layer above truly, right?
Waleed Atallah
>> Yeah.
Gemma Allen
>> And CUDA won, they won in a number of ways, but they really want at the API level That's where the kind of hearts and minds got taken. So in this new era we're in now where it is about inference and it is about speed and about a completely changing landscape of what the API gatekeeper where it looks like five years from now, what do you predict? And how do you think about that from a competitive perspective? Who keeps you up at night?
Waleed Atallah
>> Well, so it's a good point. I think if you asked me this question a year ago, I would have given you a completely different answer. But today, people don't care where their tokens come from. People don't care if it's running on like AMD or NVIDIA, or this or that. They just want the tokens to show up and they want the tokens to show up fast. So in that sense, there's a ton of really talented inference provider companies out there that have really incredible performance engineering teams, and it's almost just to the benefit of everybody that there's this competition. We all push each other to deliver faster models at a better price. And ultimately as the performance improves, the margins actually get better on our side too. So we can rent GPUs, turn a profit margin on top of that while giving users a better experience. And when there's other inference providers like Baseten, like Together, like Fireworks really pushing the limits performance engineering, I think it just ends up being a better situation for the consumer while we all stay in a nice trench fight trying to get the best performance possible out of the GPUs that we all have access to.
Gemma Allen
>> Well, because you mentioned Fireworks and you mentioned a trench fight, there is also this suggestion that some folks will be acquired out of that trench fight. And there's some rumors right now, for example, around Fireworks. What is your thoughts on the kind of acquisition convergence? Do you think that's going to become an even harder traveled path over the next year?
Waleed Atallah
>> I think acquisition is going to be increasingly common, especially in this space, because what they realize is your market cap, your revenue is limited entirely by the number of GPUs that you have access to. And so these companies have two ways to get GPUs. Number one, raise the billion dollar round. Get money out there and rent as many of these GPUs as possible. Number two, get in with a strategic, get in with the AWS, with a Facebook, with an xAI, somebody who already has 500,000 of these GPUs racked and stacked and then that becomes your pool of available GPUs. And like I was talking about the supply constrained world in the beginning, nowadays having money is not enough to get the GPUs that you want. So you need some sort of strategic angle, and I think that's why some of these inference providers are going to be looking to finding that larger big brother who has all the GPUs already and they're going to use that to turn up the revenue host.
Gemma Allen
>> Well, when you're sitting at the New York Stock Exchange and you're learning that we can't actually buy our way out of a problem, we know that things are getting really interesting. So Waleed, thank you so much for joining us on theCUBE.
Waleed Atallah
>> Gemma, thank you for having me.
Gemma Allen
>> I'm Gemma Allen coming to you from theCUBE Studio here at the NYSC. This is AI Factories, part of our program at NYSE Wired. Thanks for watching.
>> Palo Alto studio connecting Silicon Valley and Wall Street. I'm John Furrier, host of theCUBE, here with Dave Vellante, my co-host.
Gemma Allen
>> Welcome back to theCUBE Studio. I'm Gemma Allen here live at the New York Stock Exchange. This is AI Factories, one of our programs with NYSE Wired and joining me now is Waleed Atallah, co-founder and CEO of Makora. Welcome, Waleed.
Waleed Atallah
>> Thank you for having me, Gemma.
Gemma Allen
>> So first of all, I'm going to start with something I read because really this made me laugh. You wrote a blog called GPU Go Brrrrr, meaning like a cranking of a physical, something that's kind of running out of steam a little bit, which really defines what you hope to achieve with Makora.
Waleed Atallah
>> That's right.
Gemma Allen
>> Start there. Break that down.
Waleed Atallah
>> Well, GPU Go Brrrrr as the people who used to have their GPU sitting on their desktop now, once you get it really cranking, the fans start really turning on. You hear the brr. And you know how well you're utilizing your GPU by how loud that fan is.
Gemma Allen
>> So Makora really is all about helping. It's not about there not being enough chips. It's about maximizing the GPUs that you have, right?
Waleed Atallah
>> Yeah.
Gemma Allen
>> That is really the premise of this business.
Waleed Atallah
>> Look, I mean, I think the whole industry is compute starved at this point. From my conversations with other clouds, with other providers, the backlog on getting new GPUs in the door is something like 12 months.
Gemma Allen
>> Wow.
Waleed Atallah
>> Especially for some of the latest chips.
Gemma Allen
>> Okay.
Waleed Atallah
>> So you want to make the most of what you have now and we also want to enable people to make the most of other types of chips beyond just GPUs. So we're talking Google TPU, Amazon Trainium, AMD, all these new chip startups too, Tenstorrent, Cerebras, you name it.
Gemma Allen
>> Yeah. So you're completely TPU agnostic. You're like Switzerland.
Waleed Atallah
>> That's right.
Gemma Allen
>> I love it. So in terms of what it takes though to truly make the most of the physical metal you have, go kind of deep on this from a tech perspective. What exactly is involved in this?
Waleed Atallah
>> Yeah. So the most important piece is a chunk of software called the kernel, the GPU kernel, or the TPU kernel. Essentially it's how you map an algorithm, like a model like ChatGPT or Claude Code or anything, how you map that algorithm to the hardware itself.
Gemma Allen
>> Okay.
Waleed Atallah
>> When I hit enter on ChatGPT and I ask it a question, it runs something like a trillion math operations before it returns back to you saying, "Hi, how are you? I'm good." And the kernel is almost described as the order that you execute those trillion operations. So there's a faster way to do it, there's a slower way to do it, and there's an impossibly slow way to do it. Having these kernels for a variety of different hardware platforms enables you to use those new hardware platforms. This is a big part of what gave NVIDIA, people call it the CUDA moat.
Gemma Allen
>> Yeah, for sure.
Waleed Atallah
>> For decades they had been developing their CUDA kernels.
Gemma Allen
>> Yeah.
Waleed Atallah
>> They made them available to everybody so nobody had to worry about that. For new chips, they have just a huge gap in those kernels.
Gemma Allen
>> Now? Okay.
Waleed Atallah
>> But nowadays with AI, closing that gap has been easier and faster than ever.
Gemma Allen
>> So I mean we can certainly get into CUDA and comparing that moat to this current moment, but what are the economics of this? What's the math on this? What is the waste versus the opportunity of running this in a more optimized way, getting the best in that GPU kernel, as you say? What sort of money's being left on the table? Break that down.
Waleed Atallah
>> Oh, so when we talk about performance engineering in terms of CUDA kernel optimization or other types of optimizations, like for example, if you were at like Facebook or Google and you got a 2 or 3% utilization improvement, they would celebrate you up and down the halls because that 2 or 3% on a cluster of 100,000 GPUs, we're talking about thousands of unlocked GPUs essentially, which can result in literally tens of millions of dollars in saving, just from reordering software a little bit. So especially for large scale workloads, AI models generate tokens and the number of tokens you can generate per second directly drives your bottom line, your margin in terms of how much money you can actually make on your AI product or service. So improving that even by small percentages, we aim for bigger percentages obviously, but it's really consequential and that's why people spend so much time and effort on performance engineering.
Gemma Allen
>> Wow. I saw a video last week on, I was at Dell Tech World, of somebody basically asking an AI model to find a paperclip, like a paperclip and they spent like the best paperclip possible and it spent like $100 on tokens on, I think it might have been Claude or one of those, to get the best results on a paperclip that costs like what, I don't know, a couple of cents.
Waleed Atallah
>> I do.
Gemma Allen
>> So the tokenomics conversation right now is feeling like somewhat insane, right? It's almost like a mania that's happening out there. We hear about token leaderboards. The more you use, the smarter and more determined you are. So I mean, what are your thoughts and like what does it mean for your company? Because there also seems to be this element to this conversation, especially as an enterprise buyer or an atypical buyer in this stack that you kind of don't know what you don't know as well, right?
Waleed Atallah
>> So I think especially when Claude Code and Codex have ... I feel like they have their major unlock moments every now and then. And I do think that the beginning of this year, probably January, February timeframe, a whole new set of applications became reasonably possible with Claude Code. That's when we saw from Uber to Amazon, people talking about token leaderboards, token maxing.
Gemma Allen
>> Yeah.
Waleed Atallah
>> Right? And then people quickly realized that that's not exactly the best metric. I mean, I could spend a million tokens doing paperclip experiments. But ultimately, people want to get to results and how many tokens or how many dollars does it take to get to that result? If you can increase your efficiency of the GPUs, we can really start cutting into those costs. And especially with really powerful open source models now and delivered on platforms like ours, you can get really competitive performance, really competitive intelligence for as much as 10, 20, 30 times less than using these proprietary models.
Gemma Allen
>> Oh, wow.
Waleed Atallah
>> So there's now just a lot more options, especially when the performance engineering starts to extend beyond just writing kernels and into serving entire models.
Gemma Allen
>> So give me the profile of a typical buyer here. Who are you guys targeting predominantly? Where is the low hanging fruit in the space right now?
Waleed Atallah
>> So we have two products. One of them launched recently as last week, but previously before that we had built a product called MakoraGenerate. This was a tool that could help you write these GPU kernels for a variety of hardware platforms, and our customers were chip companies. We'd worked with AMD, we've worked with Google, we've worked with Tenstorrent, and we develop kernels for those chips and those were really nice and fruitful. But ultimately we wanted to get into like higher up the stack. Instead of just writing the kernel, can we serve an entire model? So the product we launched last week is our inference platform where we serve full open source models and you can just pay per token. And for this, we're really targeting people who need the fast tokens. A really good example, Claude Opus 4.6, really powerful, really good model. They also have Claude Opus 4.6 fast. Same exact model, two times the speed, six times the price. There were people willing to pay six times for premium tokens, faster tokens. And so we're building an inference platform that gives people the option for those premium fast tokens across a wider variety of models.
Gemma Allen
>> Wow. So you mentioned CUDA and I know you mentioned AMD, so I presume you use , but you're not really competing there though, right? You're competing at a layer on that, right? It's a layer above truly, right?
Waleed Atallah
>> Yeah.
Gemma Allen
>> And CUDA won, they won in a number of ways, but they really want at the API level That's where the kind of hearts and minds got taken. So in this new era we're in now where it is about inference and it is about speed and about a completely changing landscape of what the API gatekeeper where it looks like five years from now, what do you predict? And how do you think about that from a competitive perspective? Who keeps you up at night?
Waleed Atallah
>> Well, so it's a good point. I think if you asked me this question a year ago, I would have given you a completely different answer. But today, people don't care where their tokens come from. People don't care if it's running on like AMD or NVIDIA, or this or that. They just want the tokens to show up and they want the tokens to show up fast. So in that sense, there's a ton of really talented inference provider companies out there that have really incredible performance engineering teams, and it's almost just to the benefit of everybody that there's this competition. We all push each other to deliver faster models at a better price. And ultimately as the performance improves, the margins actually get better on our side too. So we can rent GPUs, turn a profit margin on top of that while giving users a better experience. And when there's other inference providers like Baseten, like Together, like Fireworks really pushing the limits performance engineering, I think it just ends up being a better situation for the consumer while we all stay in a nice trench fight trying to get the best performance possible out of the GPUs that we all have access to.
Gemma Allen
>> Well, because you mentioned Fireworks and you mentioned a trench fight, there is also this suggestion that some folks will be acquired out of that trench fight. And there's some rumors right now, for example, around Fireworks. What is your thoughts on the kind of acquisition convergence? Do you think that's going to become an even harder traveled path over the next year?
Waleed Atallah
>> I think acquisition is going to be increasingly common, especially in this space, because what they realize is your market cap, your revenue is limited entirely by the number of GPUs that you have access to. And so these companies have two ways to get GPUs. Number one, raise the billion dollar round. Get money out there and rent as many of these GPUs as possible. Number two, get in with a strategic, get in with the AWS, with a Facebook, with an xAI, somebody who already has 500,000 of these GPUs racked and stacked and then that becomes your pool of available GPUs. And like I was talking about the supply constrained world in the beginning, nowadays having money is not enough to get the GPUs that you want. So you need some sort of strategic angle, and I think that's why some of these inference providers are going to be looking to finding that larger big brother who has all the GPUs already and they're going to use that to turn up the revenue host.
Gemma Allen
>> Well, when you're sitting at the New York Stock Exchange and you're learning that we can't actually buy our way out of a problem, we know that things are getting really interesting. So Waleed, thank you so much for joining us on theCUBE.
Waleed Atallah
>> Gemma, thank you for having me.
Gemma Allen
>> I'm Gemma Allen coming to you from theCUBE Studio here at the NYSC. This is AI Factories, part of our program at NYSE Wired. Thanks for watching.