Arcitecta CEO and founder Jason Lohrey presented his thoughts on the company’s progress and future at an IT Press Tour event in New York this month.
Australia-based Arcitecta’s Mediaflux distributed data management software supports file and object data storage with a single namespace and tiering capability covering the on-prem, public cloud, and hybrid environments, with SSD, disk, and tape storage tiers. There is a Livewire data mover and metadata database. Mediaflux Multi-Site, Edge, and Burst offerings help geo-distributed workers get fast access to shared data – text, images, video, time-series, etc. – with Mediaflux Real-Time offering virtually instant access to content data. Arcitecta competes with Datadobi, Hammerspace, and Komprise. Recent customer wins include Princeton University, Dana-Farber Cancer Institute, the National Film and Sound Archive of Australia, Technische Universität Dresden, and the UK’s Imperial War Museum.
As organizations store more and more data, heading toward hundreds of billions or even more than a trillion files, Lohrey said: “Pretty much everyone needs data management, and that’s good for us. That’s the space we’re in, including those that are doing geo-distributed.” This needs more than a distributed file system as it’s “a flow of data with a single pane of glass to see where all of these things are at any point in the time and where they’ve been and controlled to create these super holdings of data and orchestration of flow.”
“That’s not your normal file system kind of thing, but it will actually involve file systems and general distributed data.”
“During the last year, we figured out how to double, or better than double, the information density of the database. This, to me, is the Holy Grail; how to increase the density of information in the database with a given footprint, and it is the thing that we’ve spent years and years and years and years of R&D on.”
“And this year, I think we’ve cracked the nut open in terms of vastly improving the density of information storage that will be, those algorithms will be rolled out in the coming year, I think, to increase the amount of information we can store in the system, it’ll improve the performance and allow us to have more indices and sets of things within a given unit of storage space in the database.”
He said that Arcitecta is unlike other data management suppliers: “What differentiates us from others is that we’re in the data path. And I really think real data management must entirely be in the data path. It’s the only way you can do things like understanding where things are in real time, and be able to find things in tens of milliseconds.
“We have a customer where we’re exporting 70 million ordered events per month, and that is enabling us to tell exactly what was created in the file system, by whom and from what vector, what was deleted by whom, and from what vector, what was accessed, every single access to those files at any point in time, every rename, every operation, every metadata operation, every data operation. And then we produce analytics outside of this to determine what the shape of access is in these systems, and how much of that data is being used or not.”
One of Arcitecta’s customers has used this. “They’ve realized that 12 out of their 18 petabytes is not active data at any point. Because we’ve got access records back to 2012, we can tell that those are not active. So they might actually move a lot of that data to tape as well, and just keep a much smaller high-performance storage system than they would normally keep.”
He said that datacenters need to be built in places where there is energy for compute and water for cooling. This alters system design assumptions. “It used to be that we took compute to the data, and we’re still very interested in that, but, in fact, we might need to really take our data to the compute, where the energy and water is actually.”
This means high-performance networking becomes more important. “To me, that is part of the overall vision that’s driving what we’re doing as a platform, general orchestration. Not just concentrated on a single file system here, high performance parallel file system there, or local enterprise storage, but the ability to move data wherever it wherever it’s needed, and have very distributed systems.”
Arcitecta has added more tape library support in the last year, including SpectraLogic, Grau Data, and IBM Diamondback. Lohrey said: “Mediaflux managers will keep track of all the barcodes on tapes in the system, so that if you have an issue where something is corrupted, we can go back and recover it.”
Looking ahead, Lohrey said: “Where are we going to go to from here?”
Specifically, Arcitecta will add a Python module to Mediaflux, and upgrade the DAMS (Digital Media Asset Management System). It will expand the vector database, and streamline Mediaflux’s deployment.
Lohrey discussed Arcitecta’s general direction for the future, and mentioned two aspects.
First: “We’re going to go further up the stack, which means we’re going to build more applications and things like our digital asset management application. You’ll see more of those that are integrated with the platform. I’ve got a decade’s worth of things that we could build up our sleeve. I’ll probably keep them up my sleeve for the while.”
We wouldn’t be surprised to see AI data pipeline-related applications coming.
Second: “We’re still going to go further down. So that means we’re going to do more storage management underneath, so integrating. … It’ll become less clear where the boundaries are between us and the storage. Because most people are interested in the protocols at the top and the management of their data, and we can hide away a lot of the storage underneath those layers, so we can actually simplify the entire stack just by having us drive the hardware underneath.”
This suggests Arcitecta could be developing a software-defined storage layer.
There will be some sort of action on the HPC front: “I once said that we would not do HPC file systems. And in fact, actually, if you look at something like Dell’s Project Lightning, that’s pretty impressive. When that comes to pass, we’re not going to compete with that. But I think there’s a very good chance that we can do a lot, a very significant number of HPC work. So we’re going to start doing more on that front. So you’ll end up with your grand unified file system.”
Traditionally, HPC data storage has meant parallel file systems, like Storage Scale and Lustre. Arcitecta competitor Hammerspace has used parallel NFS (pNFS) technology to build is data orchestration/data management product. We think Arcitrecta might be looking to use pNFS to add parallelism to Mediaflux and so have a foundation for HPC features.