three blocks
Datacore Software

Analysis

An Analysis of Latent Sector Errors in Disk Drives

posted on 19 February 2008 09:24


NetApp and Wisconsin university study

A Wisconsin University researcher and three NetApp employees have published a research paper looking at disk drive latent sector errors (LSEs). Just under 4 percent of the surveyed disks developed such errors over the study's time period.

LSEs are errors in a sector of a disk drive that occur but do not get detected until some time later when the data in that sector is unrecoverable. The sector cannot be read or written or there is an unrecoverable ECC error. The authors write 'Such errors can usually be re-paired by rewriting the data to a spare sector without having to replace the entire disk drive.'

They also note 'that a single latent sector error can lead to data loss during RAID group reconstruction after a disk failure.'

The researchers looked at the LSE occurrence rates in a population of 1.53 million nearline and enterprise disk drives in more than 50,000 arrays in production use at customer sites over a period of 32 months, making it a large-scale study. From the employment of three of the four authors with NetApp and a note in the paper we know that these are drives in NetApp filers running Data ONTAP.

In the study the nearline disks were serial ATA (SATA) drives and the enterprise disks were Fibre Channel drives.

What did they find?

- 3.45% of the drives developed LSEs.
- Nearline disks were more likely to suffer them than enterprise disks.
- The LSE rate increases over time: linearly for enterprise disks; faster than that for nearline disks. There is a sharp rise in the LSE rate for nearline disks for their second year of life.
- The LSE rate increases as disk capacity increases.
- LSEs tend to be close together in terms of disk location and in time.
- Once a disk has developed an LSE it is likely to develop more.
- Disk scrubbing can detect more than 60% of LSEs.

Disk Scrubbing

The authors describe scrubbing thus:-

"Our storage system periodically scrubs all disks as a proactive measure to detect latent sector errors and corruption errors. Two types of scrubs are performed – media scrubs and data scrubs."

"Media scrubs use a SCSI Verify command to validate a disk sector’s integrity. This command performs an ECC check of the sector’s content from within the disk without transferring data to the storage layer. On failure, the command returns a latent sector error. The storage layer performs media scrubs continuously in the background, with the rate of scrub adjusted so as not to impact foreground performance. Media scrubs typically complete within 2 weeks."

"A data scrub is primarily used to detect data corruption. This scrub issues read operations for each disk sector, computes a checksum over its data, compares the checksum to the on-disk 8-byte checksum, and reconstructs the sector from other disks in the RAID group if the checksum comparison fails. Latent sector errors discovered by data scrubs appear as read errors."

They recommend its use:-

"Given that only 3.45% of disks in our study ever developed a latent sector error, we believe that detecting errors through a low priority background scrubbing process is sufficient. Indeed, our data shows that over 60% of all latent sector errors are discovered by the media scrubbing process, which scans the entire surface of the media at least once every two weeks. This convinces us that media scrubbing is effective in discovering problems that may result in data loss."

Another suggestion is that crucial file system information is spread over a disk's surface to minimise drastic file system errors through closely-related LSEs.

Read the article in PDF form by downloading it from here.

tags:  NetApp