Block

PEAK:AIO bets on open pNFS to take on Lustre

Published

Software-defined, high-performance all-flash storage company PEAK:AIO is adopting parallel NFS (pNFS) as it evolves from a single-node system into a scale-out product.

Mark Klarzynski

CTO and founder Mark Klarzynski told us: “The big thing really is scaling out. And the bigger thing nowadays is that quest to replace Lustre. I don’t want to criticiz Lustre, but it’s getting on some and pNFS has been [our] focus.”

“We’ve been working on pNFS for 18 months or so. We’ve been fortunate to work with Los Alamos National Labs and, most recently… to work with [Carnegie Mellon University] in terms of exceptionally large scale, to try and make pNFS a realistic HPC-replacement, a modern day file system.”

In NFS versions 3, 4, and 4.1, metadata and data share a single I/O path. The addition of parallelism to NFS requires additional metadata, such as specifying which parts of which file are located on which data server. With pNFS, the metadata and data are handled on different I/O paths. A metadata server handles all metadata activities from the client while the data servers provide a direct path for data access. Suppliers that support pNFS, such as NetApp and Hammerspace, have designed their own metadata scheme and written their own metadata code.

Klarzynski said: “What we’ve made the bold step to do is actually make the pNFS metadata software open source. We announced that at MSST two weeks ago.” IEEE International Massive Storage Systems and Technology (MSST) Conference, coinciding with the 40th anniversary of NFS.

This is the main piece of software that turns NFS into parallel NFS.

He said: “Commercially you could say, why would you do that? But realistically, for a new standard to reach the level that we need it to do in the market, needs more than just us.”

Roger Cummings

CEO Roger Cummings agreed: “We’re hearing it from customers too… and various governments around the world. These [Lustre and NFS] systems are getting so large, they need something that they can replace it with [and] they can’t be single vendor locked-in, in any regard.”

Klarzynski said: “You’ve got to be more open. You need the standard to be adopted and pNFS Flex Files has an amazing opportunity’ if we embrace it. We had a great reception from the NFS community and the co-founders of NFS were there everybody onwards. So they’re beginning to contribute.”

The pNFS Flex Files idea is a Flexible File Layout for pNFS in which data storage devices have only a limited interaction with the metadata server. It also supports client-side mirroring for file replication.

Klarzynski said: “Performance-wise, we’re doing some exceptional stuff. With a single 2RU system, you now can do 320 gigs per second and we linearly scale on everyone we’ve tried.”

Cummings added: “You take that building block and you build it on top of each other; you can scale both up and down… When we do go GA with the file system, you’re going to see our software, it’ll have the AI Data Server, it’ll have the software for the file system, and it will immediately recognise additional nodes that come online. It’ll be very easy for customers to scale.”

We asked Klarzynski about CXL and fast object storage access.

B&F: If I had a data server that had a big chunk of DRAM in it, which is accessed via CXL, then the x86 servers inside GPU servers could access the same CXL memory. If I were to load a data server’s CXL with data from its flash, would that provide a faster route for the data to get to the GPUs than NVMe or RDMA?

Klarzynski: “Yes, greatly. Not necessarily always in bandwidth, which is how everybody measures it, but certainly in latency. And that’s certainly, when we talk about GPUs, one of their big challenges, everybody’s jumped on KV Cache. That’s the new thing as we all know. But the reality is, you’ve got this ultra fast memory inside of GPUs, but only so much.”

“And as GPUs now are doing more and more and more, they need more memory and most of them are outsourcing that to local NVMe or fabric-attached NVMe. That’s fine, that works, but it’s not, if you had a thousand GPUs, it’s not that fast. We want to be able to put CXL in front of that.”

“But CXL is not really out there. It’s growing a little bit with emulating NVMe. So we pretend we’re an NVMe driver [but] know it’s a CXL unit. NVMe is a fast low latency. So we give them low latency [by] looking like an NVMe drive… It’s a bit like the old VTL days.” VTL being virtual tape libraries which used fast disk to emulate slow tape.

B&F: Object people and other suppliers are saying unstructured information is going to be more and more stored in object protocol. We’ve got S3 as a standard and we’ve got S3 over RDMA being pushed by Nvidia, a kind of GPUDirect for objects. Where does that leave you?

Klarzynski: “We are sort of remaking the way that that protocol is serviced and almost in a parallel nature. So it’s almost like parallel NFS; parallel S3, not quite, but almost. And with the same ethos that you can buy one box and then, if you want another box, you put it on, it will scale out, you stick another box on. You don’t have to invest in a whole load of them at the beginning. So that will probably follow.”

****

This means PEAK:AIO’s Data Server can become a unified block, file, and object protocol system – a very, very fast Ceph alternative.