After several senior Lustre engineers resigned from DDN to form The Lustre Collective, a consulting-style firm focused on the parallel file system, DDN briefed us on how it sees the future of Lustre and its broader storage portfolio. Chief Product Officer Omer Asad outlined the positioning of DDN’s Lustre-based EXAScaler parallel file system alongside Infinia, its object storage software, arguing the two are complementary rather than competing products.
The Lustre Collective (TLC) says: “Our team helped invent Lustre in 2001, and was directly responsible for shipping every major release since 2003. TLC will keep Lustre open, free, and the fastest parallel file system on earth.”
It’s tempting to see object-based Infinia as a competing storage software product to file-based EXAScaler (Lustre), but Asad says they are complementary and not competing products.
From an organizational point of view, Asad said DDN’s James Coomer, SVP Product Management, has been “the one-man show at DDN.” He is “focusing on the EXAScaler product, which is the HPC product, and also helping transform the HPC strategy into what we call our Nvidia integration AI cluster strategy.”
With EXASCaler, “we still remain one of the largest suppliers to Nvidia. So a lot of the GPUs are actually trained and developed on top of the EXAScaler product that we have. And it’s going to retain continuous thrust from the motion… EXAScaler is very quickly becoming the standard for high-speed training and inferencing in NCPs, Nvidia Cloud Providers, as well. So it just keeps on going from that perspective.”
“It is 17x faster than anything else that is available in the market. It is also the basis of the Google Managed Lustre service. So Google Lustre is a first agreement with Google. So EXAScaler is the basis on top of which the Google Console runs.”
Coomer is “looking at how we can take the EXAScaler infrastructure that we have now and bring it more into training and inference, and getting it more and more comfortable there.”
Now that Sanjay Jagad has been appointed as DDN’s VP for Product Management, he “is going to run our Infinia product, which is the next generation product that we have.”
Asad explains: “What we have seen is that the AI pipeline has different stages. It’s got data preload, training, inference training, inference relearning, and then data curation. The focus of Infinia is to basically do a data transfer back and forth with Exa, and then DDN becomes the single data layer platform for the entire AI pipeline from a customer’s perspective.”
“When we face the customer, there is no difference between Infinia or EXAScaler. These systems seamlessly transfer data between the two. But the core of EXAScaler is Lustre, and we manage and preserve that. And Lustre is a parallel file system. So to add capabilities like replications, snapshotting and all of those things, it’s nearly impossible to do that in the Lustre system. I mean, a lot of the people have tried. AWS tried, gave up, GCP tried, just basically completely gave up and said, give us EXAScaler. We’ll just use that for our customers. That was a big, big win for us. So what we did was, with Sven Oehme, our CTO, James, myself, what we’ve done is basically we’ve taken a ground-up approach for what the data management and data curation strategy for AI looks like.”
“And that’s what Infinia is. [It’s] a distributed key-value store, which has got high-speed data access services and namespace services built into it. It plugs in behind the EXAScaler product to feed and load data into it. And then it also faces an NFS endpoint, an S3 endpoint, an endpoint with integration into Spark, Hadoop, and all of the data ecosystem, to become the single platform for the customer to manage and curate data.”
“Sanjay focuses on Infinia, James focuses on the overall transition from HPC into AI, and then also what our Nvidia integration ecosystem looks like.”
The picture that we see here is that, as far as an HPC and AI customer is concerned, whether they’re using HPC or training or inference, they will see a continuous DDN control plane. It’ll be consistent across whatever they do, and then they will perceive Infinia as their AI training and inference entry point. And behind the scenes, Infinia will use EXAScaler as a high speed processing engine when it needs to do so. There will be a data interchange between the two, which customers can be aware of if they want, but need not be aware of because it’s going to happen inside the products.
Asad clarifies this picture: “If the customer says, ‘Hey, I’m training at one to two gigabytes a second per GPU,’ Infinia can do all of those things, no issues. I mean, we’re running at xAI, 350 petabytes in a single Infinia cluster. As far as I know, it is the largest S3 cluster outside of AWS in a single namespace. Now, that’s insane. So it’s built for scale.”
“There are about 20 to 25 percent of those customers that are very much HPC, but… they also want 15 gigabytes a second of throughput in training. There is one thing that does that, and that’s EXAScaler. So then EXAScaler just loads that data off from Infinia, boom, and off it goes and it starts to do that training.”
“It’s very similar to the Google approach, by the way. Inside the Google clusters, the TPUs and the GPUs, when they want to go 1.5 terabits a second, it’s EXAScaler that’s running. And in the backend, it’s loading data from Google Cloud storage for that particular customer.”
EXAScaler plus Infinia is reproducing this outside Google. “In the DDN data plane, you have EXAScaler, it is going rocket speed 17, 15, sometimes 150 gigabytes a second. 20 to 25 percent of those customers do that. xAI is one such customer. Tesla, SpaceX is one such customer. Nvidia themselves is one such customer. But EXAScaler for generations doesn’t have replications, doesn’t have repositories, doesn’t have snapshots. So all those capabilities are built in Infinia.
“But then there are certain customers that say, ‘Hey, we don’t want these two things.’ So what we’re saying is, look, as far as if you want to go 3x faster than VAST, Infinia can do that. Absolutely. So you have a DDN data play. Inside the data plane, you have EXAScaler and Infinia just hidden. Now, if the customer wants to do fast S3, data versioning, data snapshots, they want to have the same data exported through NFS and Object, all of that magic is happening in the Infinia layer.”
“Now suddenly the customer says, I want to dial this thing up to 150 gigabytes per second because some fancy new LLM needs to be trained in my organization. We’re like, okay, crank it up. We preload the data from Infinia back into Exiscale and off they go.”
A key message coming through here is that EXAScaler is absolutely essential. It’s a key part of DDN’s offer going forward and it is not being sidelined by Infinia.
Asad enthusiastically concurs: “No, no, no. We can’t. I mean, we are the only company that covers the entire fricking gamut. And there’s a massive amount of customers still expanding in the HPC space. AMD just put out new GPUs that are specialised for HPC. We just took down a massive oil and gas cluster expansion along with Dell in France. It’s all EXAScaler.”
Here’s what he thinks about the senior DDN engineers leaving DDN to set up The Lustre Collective: “The thing is, Andreas [Dilger] and Peter Jones had been with the company for close to 12 years now. And if Andreas wants to do something and become the Uber god of Lustre, and I’m like, all the power to you, man, absolutely, whatever you need to be successful. At the end of the day, DDN is 100 percent committed to Lustre. We have close to about 70 people that are now in the Lustre team dedicated and focusing on that. And this is basically the expansion that we have built around our organization.”
“At the end of the day, Andrea still remains one of the closest advisors that DDN has in advancing the Lustre strategy and also in advancing the EXAScaler strategy. Peter Jones still remains one of the primary contributors in an advisory capacity to us as we jump in and handle HPC deals. But if two senior people want to expand their horizons and want to dabble in personally doing business themselves, setting up their own company for the first time, because they fancy that, we’re definitely going to encourage it. We’re not going to stop it.”
EXAScaler and Infinia are cooperating, integrated, partner products, each with their own role, underneath a single DDN control plane spanning HPC, AI training, inference, and data curation.