/ Hathaway Weblog / Computing Storage Safety

Shane :: Python, Software :: October 26, 2005 # Computing Storage Safety

I've written a small Python script that uses the equation I derived for computing digital storage safety. You can download it from my sketches page. The equation is explained in the comments. I believe the computation is correct, and it behaves just the way I expect. Although I'm focusing mainly on hard drive storage, the computation can also apply to any digital storage involving multiple media, such as optical disks or tapes.

Here is what the script outputs:

media unit survival probability after 1 year with MTBF=1000000: 0.991273
media unit survival probability after 1 month with MTBF=1000000: 0.999268
survival probability after 1 month, based on observation (faked for now): 0.98
unprotected volume (can lose 0): 0.98
RAID 0 with 2 units (can lose 0): 0.9604
RAID 0 with 4 units (can lose 0): 0.92236816
RAID 1 with 2 units (can lose 1): 0.9996
RAID 1 with 3 units (can lose 2): 0.999992
RAID 5 with 8 units (can lose 1): 0.989663107901
RAID 5 with 4 units (can lose 1): 0.99766352
RAID 6 with 8 units (can lose 2): 0.999584542567
RAID 6 with 4 units (can lose 2): 0.99996848
FEC, 4 data, 3 protection (can lose 3): 0.999994664346
FEC, 4 data, 4 protection (can lose 4): 0.999999829607
FEC, 12 data, 3 protection (can lose 3): 0.999816994311
FEC, 20 data, 10 protection (can lose 10): 0.999999999992
FEC, 20 data, 20 protection (can lose 20): 1.0

The first two lines are based on the MTBF numbers published by hard drive manufacturers. If MTBF dominates the reliability equation (although a lot of people dispute that assumption--more on that in a moment), there is a 99.13% chance one hard drive will last one year. One year is too long to wait between verification runs, though; a period of one month is more reasonable. The chance of survival of one hard drive over one month is thus 99.93%.

More about verification: periodically, the media must go through a verification process to detect lost replicas and chunks. If all data is recovered and all lost replicas are replaced a short time after verification, all periods have a similar data loss risk. Thus you can base the reliability estimate on the verification period rather than the expected media lifetime. This is a life-saver for storage area network (SAN) vendors, since they perform very frequent verification, resulting in fairly high reliability.

However, hard drive life is complicated by many factors other than MTBF. It's better to measure hard drive reliability by buying 1000 drives and counting how many survive the period between verification runs. That's expensive, though, so I made a wild guess that 98% of hard drives survive a period of one month between verification runs. I expect that estimate to be pessimistic. The third and fourth lines of the script output show my wild guess. (Does anyone have a real estimate? I haven't found anything.)

The rest of the output is based on the estimated reliability of one media unit, but even if the estimate is wrong, the script is still useful for comparing the reliability of different storage configurations. RAID 0, which has no redundancy, causes reliability to drop. RAID 1 (mirroring) raises reliability at the expense of storage size. RAID 5 is less reliable than RAID 1, as expected.

There's a new module in the Linux kernel that implements RAID 6. I'm not sure the meaning of RAID 6 is standardized in the storage industry, but the module says you can lose any two drives in a RAID 6 configuration. If it's for real, RAID 6 is an excellent middle-ground between RAID 1 and RAID 5.

Now, we're talking about a lot of 9's, but the discussion so far has been about only one stripe. (I'm using the term stripe to refer to a set of protected media units, including both data and protection bytes.) If there are two stripes, the total reliability is the reliability of one stripe multiplied by the reliability of the other stripe. To compute the reliability of a petabyte built on 1 TB stripes, take the reliability of one stripe and raise it to the 1000th power.

Let's say I have 1000 boxes, each holding four 320 GB drives in a RAID 5 configuration. That's 960 TB of usable (though unmanaged) space. The script I wrote computes that each box has a reliability of 0.99766352. (0.99766352 ** 1000) == 0.0964. Yikes! If the figures are right, the chance of retaining all of the data from month to month is less than 10%. Frequent verification helps a lot, but it still doesn't make me feel comfortable enough to build a multi-petabyte storage system, since a 320 GB drive is so large that it takes hours just to scan the surface, and that's without seeking. The verification time can only be so short.

That's why all of those 9's matter. When combining thousands of drives, to overcome the effect of the large exponent, the reliability of a single stripe has to be extremely high. Fortunately, forward error correction like the strategy implemented in DIBS and the Reed-Solomon Python extension makes it possible to build such a high level of reliability that the floating point calculation rounds the estimate to 1.0. Also note that the 20:10 FEC configuration lets you store half a protection byte for every data byte, yet the reliability ends up much higher than mirroring.

That's the math. Of course, even the best storage plans can be thwarted by fire, theft, natural disasters, or an errant system administrator..

Comments

Sascha Welter (November 16, 2005 07:00)

Makes me remember that company where I administered a file server with a hardware RAID 5, one hot spare. One morning I came in and we had 2 failed disks at once. The service guy could not believe it. We had recent tape backups (that was at a time when tapes could still almost hold up with HD capacities) and one of the drives was still readable, so damage was not too big. But getting it all running again took a long time, lots of wasted worktime.

Shane Hathaway (November 16, 2005 10:01)

Yep, it happens. Even if the chance of that happening is 0.1% over a given time period, if you have many time periods or thousands of RAID 5 arrays, the chance of failure is actually quite high.

RAID in general is naive for data you want to preserve for decades. The storage vendors like to quote four or five "nines" of assurance against data loss. I think they mean each RAID array has a reliability level of 99.99% or 99.999% over the period of one day. (They never say how they computed that or what their confidence level is... they need statisticians!) If you have 1000 four-nines arrays, the probability of retaining all of the data is 90.5%. For five nines it's 99%. That is not comfortable at all.

No further comments may be added.

Doctrine and Covenants 130:22-23 (Click below to fill in the blanks.)
Your browser is not able to display the scripture fill-in program. To see it, enable Javascript or use Mozilla 1.0 or better.

Church: lds scriptures provident games pearls kzion shiblon film chancellor gateway cumorah byutv happiness nephi
Zope: freezope org com zen labs newbies zettai warnes
Python: home pyzine daily icanprogram
Genealogy: cyndi
Weblogs: jeffrey paul jon joel another-shane guido barry jeremy windley chrism zac
News: quakes lwn dc weather deseret zeitgeist softwarelivre
Zaurus: software developer
Tech: tango spintronics thin
Semantic: aaron sean
Reference: css rdf html4 javascript geckodom iecss emacs phrases acronyms
Reverse: advogato slashdot
Misc: gimp-savvy directory soda jokes shouldexist pdphoto