Opinion: Online disk archives are just wrong

A question: what’s the difference between nearline disk storage and an active archive system only using disk drives? The answer is none.

The Cambridge Dictionary defines the word archive thus: “A computer file used to store electronic information or documents that you no longer need to use regularly.”

In that case it no longer needs to be stored on disk drives offering continuous access.

Active Archive Alliance

The Active Archive Alliance (AAA) organization definition of an active archive says: “Active archives enable reliable, online and cost-effective access to data throughout its life and are compatible with flash, disk, tape, or cloud as well as file, block or object storage systems. They help move data to the appropriate storage tiers to minimize cost while maintaining ease of user accessibility… Creating an active archive is a way to offload Tier 1 storage and free up valuable space on expensive primary storage and still store all of an organization’s data online.”

In other words, an active archive covers non-primary data, meaning secondary (nearline) and tertiary (offline) data with no mention of online media being restricted to caching. The AAA is saying you can have an online archive.

Its version of the four-tier storage model omits the media types from all the tiers, and contains a deep archive sub-class:

Active Archive Alliance 4-tier storage model as shown in the Storage Newsletter
Active Archive Alliance 4-tier storage model as shown in the Storage Newsletter

This opens the door to online archives, such as products from disk drive maker and AAA member/sponsor Seagate.

Seagate Enterprise Archive Systems

Seagate describes enterprise data archives as “storage systems or platforms for storing organizational data that are rarely used or accessed, but are nevertheless important. This may include financial records, internal communications, blueprints, designs, memos, meeting notes, customer information, and other files that the organization may need later.”

The “early enterprise data archives were mostly paper records kept in designated storage units… More recently, organizations are moving their data archives to cloud-based solutions. Cloud-based solutions make data archives more accessible and reduce the associated costs.” 

Cloud-based solutions include on-premises object storage disk-based systems using Cloudian, Scality or other object storage software and Seagate Exos disk drive enclosures or its Lyve Cloud system of a managed disk array service.

There is no concept of the disks caching data in front of a library of offline tape or optical disk cartridges here. Analyst Fred Moore of Horrison Information Strategies has a different view.

Horrison view of archive

He explains what an archive means in a “Building the Archive of the Future” paper sponsored by Quantum. Unlike backup, which is making copies of data so that the copy can be restored if the original is lost or damaged, an archive is a version of the original data from which parts can be retrieved, not restored.

This definition, with the restore vs retrieval keystone, is the one used by W Curtis Preston in his Modern Data Protection book published in 2021.

Modern Data Protection by W Curtis Preston talks about archive storage

The moving of data to archival storage frees up capacity on the primary storage location and takes advantage of cheaper and higher-capacity long-term storage with slower access times, such as tape or optical disk. Moore says there are two kinds of archive; an active archive composed from offline tape and online disk drives, and a longer-term or deep archive composed just from offline storage.

An archive is defined as well by its use of specific software; object storage software that scales out and geo-spreads unstructured and object data to manage and protect archival storage needs. It includes smart data movers, data classification and metadata capabilities.

Moore says: “A commonly stated objective for many data center managers today is that ‘if data isn’t used, it shouldn’t consume energy.‘” This clearly places tape as the greenest storage solution available. He suggests: “Between 60 and 80 percent of all data is archival and much of it is stored in the wrong place, on HDDs and totals 4.5-6ZB of stored archival data by 2025 making archive the largest classification category.” Note that thought: “Stored in the wrong place… on HDDs.”

Archive is mentioned in Fred Moore’s four-tier storage diagram
Fred Moore’s four-tier storage diagram

His point is clear: disk storage is the wrong medium for an archive. What role then does disk play in the active archive tier? Moore says: “An active archive implementation provides faster access to archival data by using HDDs or SSDs as a cache front-end for a robotic tape library. The larger the archive becomes, the more benefit an active archive provides.”

In Moore’s view, online media, disk or NAND, is a cache in front of a tape library, not a storage archive tier in its own right. That’s quite different from the Active Archive Alliance viewpoint.

Online archives and nearline storage

The AAA’s active archive definition is confusing as it includes both online and offline media. For Moore, an archive is inherently offline.

An archive in the traditional sense should not include storage systems using constantly moving media, such as disk or tape; it uses too much electricity and archive data access needs generally don’t require continuously available access. An archive should be based on offline media only, with a front-end online cache for active archives.

To my mind there needs to be a strong distinction between offline and online archive media because the energy consumption and access characteristics are so different. Letting online disk into the same category of system as offline media is like letting a carbon-emitting fox into an environmentally green hen house. Calling disk-based storage an active archive systems is a misnomer. They should be regarded as nearline object storage systems.

Some Active Archive members appear to agree. In an August 2022 blog, IBM’s Shawn Brume, Tape Evangelist and Strategist, said: “In a study conducted by IBM in 2022 that utilized publicly available data, a comparison of large-scale digital data storage deployments demonstrated that a large scale 10 petabyte Open Compute Project (OCP) Bryce Canyon HDD storage had 5.1 times greater CO2e impact than a comparable enterprise tape storage solution.” 

Brume blog graphic on archive storage
Brume blog graphic. Tape is far more environmentally friendly than disk

“This was based on a ten-year data retention lifecycle using modern storage methodologies. The energy consumption of HDD over the life cycle along with the need to refresh the entire environment at Year 5 drives a significant portion of CO2 emissions. While the embedded carbon footprint is 93 percent lower with tape infrastructure compared to the HDD infrastructure.”

Brume goes on to include the AAA’s four-class tiered storage diagram in his blog, which distinguishes between active archives and archives, which have the deep archive sub-class.

Seagate and spin-down

You could theoretically have a disk-based archive system if it used spin-down disks. This was tried by Copan with its MAID (Massive Array of Idle Disks) design back in the 2002-2009 period, and revisited by SGI in 2010. It’s not been successful, though.

Disk drive manufacturer Seagate actually produces spin-down disk systems. Its Lyve Mobile array is  “portable, rackable solution easily integrates into any data management workflow. Get versatile, high-capacity and high-performance data transfers. With industry-standard AES 256-bit hardware encryption and key management in a rugged, lockable transport case.” The disk drives are not spinning when the transport case is being transported.

In theory then, it could develop a spin-down Exos or Corvault disk enclosure and then its attempts to present itself as lowering the lifetime carbon emissions of its products would have a stronger substance.