three blocks

Analysis

Combatting silent SATA drive errors

posted on 10 July 2008 08:06


Infortrend says use drives with ECC checks and switch on controller features

Are some RAID controllers not carrying out read integrity checks on SATA drives when they should? Blocks and Files recently carried a news story about RAID Inc. OEM'ming NEC's D-Series arrays to its government and high-performance computing (HPC) customers because these were the only controllers that detected and prevented the silent drive errors and read integrity failures noted by some of their customers with other products.

Jerome Wendt of DCIG also wrote on this issue: "(Bob) Picardi (of RAID Inc.) indicated that they were getting some very disturbing feedback from their existing HPC customers that used SATA-based storage systems. In these environments, some of their clients were running the same query against the same data set and coming back with different answers. This was not a one-time occurrence but occurring frequently enough that they felt the need to change out their storage systems. Their clients then began internal procedures to double-check their answers as well as checked with their colleagues at other HPC locations and found that they too were having similar problems with SATA-based storage systems."

Picardi identified Infortrend and Xyratex controllers as being insufficient in the read integrity checking regards, saying, "RAID Inc does hold OEM partnerships with both Infortrend and Xyratex and ships rebranded products to our customers today. However, when working with customers who have extreme amounts of data and the critical nature of their work demands the highest level of data integrity, we turn to NEC’s D-series to address that."

On this topic of silent SATA drive errors and read integrity checking, Alex Young, director of technical marketing at Infortrend Europe, writes:

'Each element in the RAID configurations is key as they all contribute to the final result of data integrity, reliability, performance and data protection.'

'According to the three major disk drive manufacturers, each disk drive has carried out an ECC verification when reading the data. The ECC verification code was written to the disk platter together with the data, when writing to the disk. The ECC verification on read is how the disk drive detects and reports bad blocks to the RAID controller. If one particular disk drive model does not have the ability to detect the data integrity and report the errors to the RAID controller, that particular disk drive model or drive firmware version should not be used in the RAID at all.'

'There are disk drive models that should not be used in RAID, for example, the disk drives models that have been specially tuned for consumer video recorders. In some of the specially tuned disk drive models, the disk drive may skip the verifications on read. Customers should avoid using disk drives that do not provide the ability to detect their own data integrity, or disk drive models that were not qualified by the RAID vendor.'

'Adding an additional layer of data integrity check when reading from the disk drives does provide additional protection against the data error from the disk drives. In order to do that, additional space for the ECC will be required for each data block on the disk drive. That means the industrial standard 512-byte-per-block disk drives can no longer be used, instead the disk drives will have to be pre-formatted into special pre-defined format (maybe 520 or more bytes per block) in order to accommodate the additional ECC. The customer can only purchase the disk drives from this specific vendor (with possible additional cost), losing the freedom of choosing their preferred disk drive models or vendors on the market. Not to mention the additional impact to the read performance, which additional performance bottleneck was generated by the additional data integrity checks.'

'By using the qualified disk drive models and drive firmware versions, the tight quality control from today's major disk drive manufacturers can ensure the data integrity from the disk drives. From the RAID parity point of view, in many RAID models (for example Infortrend RAID), there are features that can be enabled by the end user, to verify the RAID parity, or ECC on the disk drives, constantly at the background, and/or to verify the ECC when writing to the disk drives. All these can ensure that the data integrity of the information stored on the RAID is secured.'

[Chris Mellor.]




tags:  RAID SATA