theCUBE + NYSE Wired: AI Factories - Data Centers of the Future | Steve Klinger, Lightmatter

Clips
More from theCUBE + NYSE Wired: AI Factories - Data Centers of the Future

Steve Klinger

VP of Product

Lightmatter

play_circle_outline Bridging Innovation: theCUBE Program Links Silicon Valley and Wall Street While Launching the AI Factory Series at NYSE

play_circle_outline Enhancing AI Performance: Overcoming Interconnect Bottlenecks in Data Centers with Silicon Photonics Solutions and Architectural Strategies

play_circle_outline Challenges and trade-offs in system traffic and bottleneck issues in data centers.

play_circle_outline Introduction of 3D integration for improved data escape from chips.

Info
Transcript

Steve Klinger, Lightmatter

Steve Klinger

VP of Product Lightmatter

In this theCUBE + NYSE Wired segment from the AI Factories – Data Centers of the Future series, Steve Klinger, VP of Product at Lightmatter, joins theCUBE’s John Furrier at the New York Stock Exchange to unpack how silicon photonics is tackling the interconnect bottleneck inside AI-scale data centers. Klinger explains why optics brought directly to the chip are crucial as GPU/XPU performance outpaces traditional electrical links. He details how Lightmatter’s approach increases bandwidth and I/O density – moving beyond “shoreline” limits with 3D integration – ... Read more

explore Keep Exploring

What is the focus of the ongoing AI Factory series mentioned in the text? add

What is the focus of Lightmatter and how are they addressing interconnect bottlenecks in data centers? add

What are the key trade-offs and challenges related to chip connectivity and data signaling in modern GPUs and XPUs? add

What advancements are being made in GPU and network switch chip design to achieve higher bandwidth and manage heat density effectively? add

bolt Powered by CUBE AI

Steve Klinger, Lightmatter

search

>> Welcome back around to theCUBE here in the New York Stock Exchange. I'm John Furrier, your host. Of course, we have our East Coast Palo Alto Studio and here at the NYSE connecting Silicon Valley and Wall Street as part of the NYSE Wired program and community, bringing the two worlds together. And a great topic kicking off this week with our AI Factory series. It's going to be an ongoing series featuring the leaders who are building out these large-scale data centers, the infrastructure needed to power the generation that's coming of software, new software stacks, new chips, technology at power, AI-native applications, and ultimately create a better society and better productivity environment. Steve Klinger is here, Vice President of Product at Lightmatter. They're doing amazing work in one of the hottest areas that we've been covering, photonics, interconnects, the secret sauce to connecting all these big chips that are powering the AI. Steve, thank you for joining us today and thanks for coming in from California.

Steve Klinger

>> Oh, thank you, John. Appreciate the opportunity. Nice to talk with you today.

>> So you heard my little monologue there, AI factories, a term that was by Jensen Huang. Kind of abstract away to the general world about all the hard stuff that goes on under the covers around making AI work. AI factory is a very simple concept. Oh, yeah, produces tokens, it produces stuff. It provides benefits. Okay, I get that. But there's a lot of work going on. Okay. And I'd love to get in with you. You guys specialize in an area, like I said, it's one of the hottest areas right now, making high-speed connections between things, devices and chips. Can you talk about what you guys are doing, where you guys are in the progress of it and some of the success and momentum you have?

Steve Klinger

>> Yeah, sure. So at Lightmatter, what we're focused on really is unlocking the interconnect bottlenecks in data centers. So if you look at what's happened over the last few decades, the performance of the compute has scaled actually quite well, but the interconnect, in other words, the connectivity between all the compute chips has grown at a much slower pace. And so what that's led to at this point is that the continued performance scaling for AI is really being hindered by that interconnect bottleneck. So what we're doing at Lightmatter is really utilizing silicon photonics to provide much higher bandwidth and also enable much higher connectivity in terms of the number of inputs and outputs that can come into and out of each chip. And using light to transmit the data to connect much larger numbers of compute units together and a much tighter fashion to continue the AI performance scaling.

>> So that really, if you zoom out on that, I would like you to explain, because this hits the two of the hottest areas right now that architecturally it's scale up and scale out areas. Normally, the discussion has been trade-offs between the two. Could you talk about how you guys are intersecting that piece and where you fit in? Is there one side or the other? You mentioned number of, I don't think you said nodes, but systems.

Steve Klinger

>> Yeah.

>> Scale up, obviously, high bandwidth, tightly coupled systems, but now you're talking about systems, connecting systems too. So normally it's like a balance. Okay, I got to do more of this. How do you weigh in on that? What's your position? What's your view?

Steve Klinger

>> We really fit in both places, but if you look at how most AI data center racks are built today, you have copper based connectivity within the racks. And then once you go beyond the number of compute units that fit inside the rack, then you're going to pluggable optics. The trouble is the amount of bandwidth you have there is much lower. So you've created a choke point. So what we're doing is bringing the optics directly to all the chips so that you're not constrained in the scale up domains, if you will, by how many chips you can fit in one rack. You can scale the connectivity across many racks. And so now many racks can act together as that scale up domain.

>> Yeah.

Steve Klinger

>> And the latencies and the amount of bandwidth that we can escape from the chip also is dramatically higher than what you can do if you're just connecting electrically.

>> And copper does a job, but a short distance is not really good for that, right?

Steve Klinger

>> Yeah. Relatively, short distances, a meter or two kind of within the rack. And particularly, as the speed of the signals get higher and higher, that distance is really limited. So with optics now, we've completely blown away that limitation. You could now go hundreds of meters across many racks in the data center as opposed to just having copper based connectivity with really constrained to that one rack.

>> Could you scope the bandwidth throughput, the throughput side, and then the energy per bit piece? Because this is where we're seeing a lot of discussion, right? Energy is huge, right?

Steve Klinger

>> You're absolutely right. So you have to go optical, like I said, once you get outside of the rack to connect larger numbers of these racks together into a larger pod. If you look at what the power consumption of a pluggable optical transceiver is, which is prevalent way of doing it in the industry today, the picojoule per bit is in the teens to, let's say 15 to 20 picojoules per bit. You contrast that with an integrated silicon photonic optical approach, where we're landing chips directly on top of photonics such as with our Passage technology, talking a few picojoules per bit. So it's a dramatic savings in terms of the energy efficiency for that communication.

>> In our CUBE interviews, we get a lot of linguistics, a lot of jargon. I want to run something by you, I want to get your reaction to it. And I love jargon by the way, it's great for AI, makes the words explode. But here's a phrase. I want you to explain this to me. I hear this a lot. We have dense scale-up racks and we have a need for long reach scale-out clusters. What does that mean? And how do you fit into that? It's just where it kind of hits the point for photonics. What is a dense scale-out rack? How is that dense? Is it... How crowded is it?

Steve Klinger

>> These racks are incredibly dense. They're packing really as much as they can within the rack. Again, because they're constrained right now by having to connect everything physically very close to each other with the copper-based interconnect. So it is dense in that regard.

>> And long reach scale-out clusters. What's long reach?

Steve Klinger

>> Yeah, long reach means connecting many racks across the data center floor together. So whether that's tens or hundreds of meters. And then when you're going in between data center installations, that can be even much larger distances.

>> Okay. So multi track, multi rack, I mean. Multi rack, I get. What about Metro? I'm hearing more about scale across as a term. It's being kicked around a lot. A lot of hype around scale across, which implies connecting data centers to data centers.

Steve Klinger

>> Yeah.

>> In my mind I'm like, "Whoa, that's a latency issue." What does that mean to you guys? And how real and what does that from a usability standpoint, is there a distance limitation? Where does optics fit in? Light is better, right? Speed to light.

Steve Klinger

>> Yeah. Light based connectivity has been around connecting metros together for a long time. So the optics is there for that type of connectivity. To your point, the larger the physical infrastructure, things like latency do come into play because light travels at what? Five nanoseconds per meter plus or minus. And so there are latency constraints with that distance.

>> On system traffic issues. What are the key trade-offs and what are some of the different bottleneck challenges that are out there?

Steve Klinger

>> Yeah, I'd say at Lightmatter with our Passage technology, what we're trying to do is really unlock the connectivity at the chip level. So if you look at the latest generation of GPUs or XPUs, the amount of shoreline of the chip that's available to escape data is highly constrained. Typically, you have memory on the east and west side of the chip, and really the IO is just coming out of the north and the south. And there's a limited amount of shore at shoreline as people call it, on that chip edge. Which means you can only fit a certain number of signals and that the increases in those signal rates aren't increasing all that fast. And we're starting to hit some fundamental limits there. So what we're trying to do at the chip level using 3D integration is now instead of only having the shoreline of the chip to escape the data and communication, you can actually locate the high speed signals into the interior of the chip. So now you have an area based scaling for the IO instead of just a shoreline based scaling.

>> Got it.

Steve Klinger

>> And so we're enabling GPU providers or network switch chip providers and the hyperscalers also who are building those type of chips to leverage that 3D integration to get instead of a few terabits per second of bandwidth into and out of an XPU, tens of terabits per second.

>> The terabits is a huge point. That's a great throughput and that's a key performance. I love that shoreline example. It's like a beach for my mind went to all the houses trying to get beachfront property. You got some mansions that have a good swath, some don't. That brings up a point around cooling. And you're seeing a lot of liquid cooling. How do you get the heat density down? I mean heat down. We're seeing co-packaged optics become part of that. What's the impact of photonics placement where you do things? Does that impact design and are you guys in that area hearing a lot of that?

Steve Klinger

>> A lot of the XPUs are liquid cooled. You're talking about a couple of thousand watt chips nowadays. So silicon photonics and co-packaged, we're really in that exact market that you're talking about. We have temperature stabilized photonic circuitry that can live right within that chip package. We have a lot of the technology that we've developed inside of Lightmatter is aimed at not only just implementing the core functionality, but making sure that it's fully stable across temperature variations, across other physical disturbances that might occur in the data center. Things like fibers being moved or lightning strikes or things like that. So we've focused a lot on the robustness aspect of it.

>> Yeah.

Steve Klinger

>> But yeah, the photonics are extremely well temperature controlled. And so in this exact environment that you're mentioning, we worked on ensuring that it's a really hardened solution there.

>> Like I said, this is a really critical infrastructure component at the system level. And systems thinking is now mainstream, which I love. We love on theCUBE because we like to talk about systems a lot as well as cloud and other things. Of course, we love hardware and software. You guys are at the convergence of scale up, scale out, which I like. It reminds me of the little hyperconvergence days in the cloud, pre-cloud that ushered in massive growth. Where are you guys on the momentum side, manufacturing volume? Can you share some momentum?

Steve Klinger

>> So where we're at is we're in the process really of bringing out with our lead customers the initial production variants of our Passage-based solutions. So we've done several generations of what I would call test silicon to prove out all of the fundamental capability characterization and get the technology proven and hardened. We recently demonstrated a 16 wavelength silicon photonic blank on a single fiber, which was a really dramatic achievement. We made that public just recently. But we're now in the process really of intercepting customer rollouts on their next generation designs that will be hitting production over the next couple of years.

>> Well, Steve, it's great to have you on the program of the convergence in this infrastructure. It's enabling all kinds of new growth. Even on net new data center build outs, the on-prem activity with these large-scale systems that are acting like basically, supercomputing. They're supercomputers. We're in the supercomputer era right now, so it's really fun. As the VP of Product, you got the keys to the kingdom, as they say in the business. What's the coolest thing you're working on right now or feature or secret sauce that people should know about or you're excited about? Knowing you got what the customers are doing on the road back because you guys have good delivery schedules, you have good visibly on what's going to happen. What's the coolest thing? What are you excited about?

Steve Klinger

>> Yeah, I think what really excites me is just the scale at which we've been able to implement this technology. It's like we're delivering the equivalent of 40 plus optical transceivers in a little optical engine that plugs onto one chip edge. And so it's like such a dramatic improvement, both in terms of bandwidth as well as how in ports the term radix is what we use. It's such a dramatic expansion in the radix and the bandwidth of what we can escape out of the chip, and we can fit into really any sort of packaging flow. So we use standard integration techniques that fit in with how people are building chips and systems today. So the exciting thing to me is just the opportunity to deliver this. And by doing that, unlock the next wave of performance. That ultimately translates into how long it takes to train a model. So there's things that we're fundamentally doing that can provide multiples of speed up in the time to train the next model. That fires me up.

>> That's awesome. And just the terabit performance, multi-terabit performance is phenomenal. Taking out hops on the network, these are nerdy things, but this is good. Thanks so much for coming on. I really appreciate it.

Steve Klinger

>> Thank you so much. Really appreciate the opportunity and thanks again.

>> All right. Take care.

Steve Klinger

>> Take care.

>> I'm John Furrier with theCUBE. We are here at the NYSE CUBE Studios, NYSE Wired community and program, bringing this ongoing series on AI. Very large scale systems are redefining the future of the data center distributed computing. It's the hardware, it's the chips, it's the connections that make it all happen. As this moves fast and accelerates through, more value is going to be built on top of it. Again, that's going to open up massive innovation around AI native applications, data movement, all these things that are going to make things more efficient and more energy efficient. Again, this is the focus, of course, we love bringing that data to you. I'm John Furrier. Thanks for watching.