|
|
/ Hathaway Weblog / Bit Mountain |
Lately, I've been working on a piece of software designed to manage a hypothetical 20 petabyte digital archive. I'm calling the software Bit Mountain. I hope to release it as free software under the GPL.
I began researching how to store 20 PB at the beginning of this year. A lot of interesting things happen when you try to build a data store so large that it requires thousands of digital media units, and Bit Mountain is a research project that tries to solve the following problems:
- Periodic verification and replacement is essential. Any high-density digital media (a tape, a hard drive, a DVD, etc.) fails over time. A tape on a shelf is generally OK, but one hundred tapes without some kind of robot is quite risky.
- Hard drives are easier to verify periodically than tapes. Hard drives are also well-understood commodity items.
- Power is a large concern. 20 PB worth of spinning hard drives would incur a power bill in the neighborhood of $100,000 per month. Over time, that power bill could even exceed the hardware acquisition cost. Unfortunately, RAID-based SANs depend on constant power to maintain data integrity. It's not hard to imagine powering up a SAN after one month of inactivity and discovering that one too many drives have been lost.
- Simple mirroring doubles or triples the amount of digital media required. RAID schemes involving one or two parity drives in a set don't require as much media, but they're also less reliable than mirroring if you shut off the power. Something better is needed.
- On top of all that, it would be such a shame to have to hire 20+ system administrators to watch over a 20 PB archive. That's a low estimate, from what I've heard! So many administrators would be much more expensive than the hardware.
Bit Mountain is a lot like MogileFS. It works at the application level instead of the kernel level, uses a relational database, performs replication automatically, talks to hard drives using simple HTTP, works with any filesystem, and can be configured with no single point of failure. Unlike MogileFS, it also incorporates forward error correction (FEC) in the form of Reed-Solomon encoding. I believe that tunable forward error correction is the key to maintaining integrity on sleeping hard drives, tapes, or optical media. FEC also consumes less space than replication.
I've come up with a formula for determining the probability of maintaining data integrity on sleeping hard drives, given the repository size, media reliability, the data protection parameters, and a verification period. I've spent enough time on this post so I'll post the equation later.
