<?xml version="1.0"?>
<rss version="2.0">

<channel>
<title>Hathaway Weblog</title>
<link>http://hathawaymix.org</link>
<description>Flux-capacitor... fluxing...</description>
<language>en-us</language>
<copyright>Copyright 2004-2005</copyright>
<lastBuildDate>Wed, 10 Dec 2008 14:46:41 -0700</lastBuildDate>
<docs>http://blogs.law.harvard.edu/tech/rss</docs> 

<item>
<title>New Weblog</title>
<link>http://hathawaymix.org/Weblog/2008-12-10</link>
<description>&lt;p&gt;I have moved to a new Wordpress-based weblog at &lt;a class="reference" href="http://shane.willowrise.com/"&gt;http://shane.willowrise.com/&lt;/a&gt; .  Please update your RSS feeds.  It took me a long time to decide to move away from the Zope software behind this blog, but I decided the features Wordpress gives me are too important to ignore.&lt;/p&gt;
</description>
<guid>http://hathawaymix.org/Weblog/2008-12-10</guid>
<author>shane.remove-this.if-you-are-not.a-spammer@hathawaymix.org</author>
<category>Weblog</category>
<pubDate>Wed, 10 Dec 2008 14:46:41 -0700</pubDate>
</item>
<item>
<title>Transparent Async I/O in Python</title>
<link>http://hathawaymix.org/Weblog/2006-05-17-01</link>
<description>&lt;p&gt;I dream of being able to write I/O code in Python in the synchronous, blocking style, then switching the code to asynchronous and non-blocking with minimal changes.  Synchronous I/O is easier to write, but asynchronous avoids the need for threads and is often faster.  Can it be done?  The challenge is this: when the software is about to do some blocking I/O operation, you have to convert the operation to a non-blocking operation (fairly easy), unwind the stack (probably not hard), transfer control to an event loop (easy), and restore the stack when the I/O finishes (hard).&lt;/p&gt;
&lt;p&gt;Stackless Python provides a way to unwind and restore stacks.  Stackless is the easiest way to find out whether the idea is a good one.  However, Stackless is not part of the core Python and may never be, so even if the idea turns out successful, not many people will use it.&lt;/p&gt;
&lt;p&gt;The new coroutines in Python 2.5 might help if you code everything that does any I/O as a generator and simulate a stack using a stack of generators.  Unfortunately, I think I'd have to write a ton of &amp;quot;yield&amp;quot; statements in that case.  I wouldn't be allowed to call any function that does any I/O; I'd have to ask the simulated stack to call the function.&lt;/p&gt;
&lt;p&gt;The coroutine-based solution might work, though, if there's a decorator that can turn a function into an I/O aware coroutine.  This decorator would turn calls to blocking functions into yield statements that transfer control back to the event loop.&lt;/p&gt;
&lt;p&gt;I wonder if anyone is trying to do this already.&lt;/p&gt;
</description>
<guid>http://hathawaymix.org/Weblog/2006-05-17-01</guid>
<author>shane.remove-this.if-you-are-not.a-spammer@hathawaymix.org</author>
<category>Python</category>
<pubDate>Wed, 17 May 2006 01:54:38 -0600</pubDate>
</item>
<item>
<title>Aching to Unveil Bit Mountain</title>
<link>http://hathawaymix.org/Weblog/2006-05-17</link>
<description>&lt;p&gt;Bit Mountain is a digital archiving system I've been building.  I've &lt;a class="reference" href="http://hathawaymix.org/Weblog/2005-10-13"&gt;mentioned it before&lt;/a&gt;.  I'm working to make it scale from from a few small files on a few hard drives to billions of files (consisting of multiple petabytes) on thousands of media.  I believe I've come to understand the storage problems that occur at that scale and have coded many of the solutions in Bit Mountain.&lt;/p&gt;
&lt;p&gt;I recently presented a paper about Bit Mountain.  You can download the slides from the &lt;a class="reference" href="http://www.fht.byu.edu/prev_workshops/workshop06/"&gt;Family History Technology Workshop 2006&lt;/a&gt; archive.  I'll also post the paper I presented if anyone requests it.&lt;/p&gt;
&lt;p&gt;I believe I just found the solution to a nagging problem in Bit Mountain.  Bit Mountain currently creates 3-4 relational database rows for every file in the archive.  This is no problem if there are millions of files, but it's an ugly problem if there are billions of files.  Databases typically don't support billions of rows without big hardware and deep magic tuning.&lt;/p&gt;
&lt;p&gt;The obvious solution is a distributed hash table, but that would require a complete rewrite of the software, and a DHT does not take advantage of natural file groupings.  Many of the images represent pages of books, so it would make sense to keep book pages together in the archive.  Keeping pages together means that when you flip pages in a digital book, loading a page is likely to transparently pre-fetch information about the other pages, and turning pages will be quick.&lt;/p&gt;
&lt;p&gt;What I decided to try is uploading bundles of files rather than individual files.  The database will record only 3-4 rows for each bundle.  To read an individual file, clients will fetch it from the bundle that contains it.  Bundles are encoded as multipart MIME messages and can span gigabytes.  For each bundle, there is a metadata file that holds an index of the files it contains.  If bundles contain an average of 1000 files, then the billion file problem suddenly becomes a mere million file problem and life is easy again.  Yesterday and today, I worked out the code for uploading bundles.  Now I need to work on the download portion.  I'm optimistic that it will perform well.&lt;/p&gt;
&lt;p&gt;Everyone I talk to at work is in favor of releasing Bit Mountain as open source software, but I can't do it without official approval.  The idea of releasing it keeps stalling for some reason.  I &lt;em&gt;ache&lt;/em&gt; to release it.  I want feedback, I want people to try it out, I want people to improve upon it.  I want people to test my theory that Bit Mountain can provide higher reliability and better storage utilization than what is achievable through replication.  I want as many people as possible to run it, harden it, and prove that it's reliable.&lt;/p&gt;
</description>
<guid>http://hathawaymix.org/Weblog/2006-05-17</guid>
<author>shane.remove-this.if-you-are-not.a-spammer@hathawaymix.org</author>
<category>Free Software</category>
<pubDate>Wed, 17 May 2006 00:16:21 -0600</pubDate>
</item>
<item>
<title>Gentoo Tip: Screen</title>
<link>http://hathawaymix.org/Weblog/2006-04-04-01</link>
<description>&lt;p&gt;Gentoo's emerge command is powerful, but it stops if you attempt to run it in the background, and it often takes a long time.  Particularly when compiling stuff on a server, it's useful to have emerge work in the background even after I log out.&lt;/p&gt;
&lt;p&gt;Solution: the screen utility.  Before starting a long compile, I type &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;screen&lt;/span&gt;&lt;/tt&gt;, which opens a shell session that I can detach and reattach later.  I start emerge, wait a little to make sure I typed the command correctly, then push &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;Ctrl-A&lt;/span&gt;&lt;/tt&gt; followed by &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;D&lt;/span&gt;&lt;/tt&gt;.  At that point, I can log out while the screen session continues in the background.  A few hours later, I type &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;screen&lt;/span&gt; &lt;span class="pre"&gt;-x&lt;/span&gt;&lt;/tt&gt; to reattach the session, often using a different computer than the one I used to start the emerge.&lt;/p&gt;
&lt;p&gt;In general, the screen utility makes it easier to perform long running automated tasks, like compiling software, downloading large files, or converting media files.&lt;/p&gt;
</description>
<guid>http://hathawaymix.org/Weblog/2006-04-04-01</guid>
<author>shane.remove-this.if-you-are-not.a-spammer@hathawaymix.org</author>
<category>Linux</category>
<pubDate>Tue, 04 Apr 2006 02:26:09 -0600</pubDate>
</item>
<item>
<title>Elenco MWK-06 Addendum</title>
<link>http://hathawaymix.org/Weblog/2006-04-04</link>
<description>&lt;p&gt;I bought a train model kit for Esther at Christmas.  After we painted and assembled it, the motor ran fine, but the wheels stuck often.  The mechanism relies on the left and right sides being a quarter turn out of phase, but the nylon connectors often slip on the metal axles, causing the wheels to lock up.&lt;/p&gt;
&lt;p&gt;I found a nice solution to the problem.  Anyone who buys the MWK-06 kit and has problems with slippage probably ought to do this.  The strategy is to ensure the nylon &amp;quot;L&amp;quot; connectors never slip, and I did that by grinding the ends of the axles.&lt;/p&gt;
&lt;blockquote&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;Use a small metal grinder (I used a rotary tool) to grind the ends of every axle into flattened surfaces, with the two ends of each axle oriented exactly out of phase with each other.  In other words, grind one end in a horizontal orientation and the other end in a vertical orientation.&lt;/li&gt;
&lt;li&gt;Squeeze the axle shafts on the nylon L connectors using pliers, flattening the round holes and turning them into ellipses.  Flatten all of them the same way.&lt;/li&gt;
&lt;li&gt;Press the flattened L connectors onto the axles.  The connectors should fit easily, but snugly; if they are loose, squeeze the L connectors some more.&lt;/li&gt;
&lt;li&gt;To allow free movement, tighten the screws connected to the pulleys and the second pair of wheels from the back as much as possible without causing friction.  Loosen the other 4 screws as much as possible without letting the nut fall off.&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;
&lt;p&gt;I can supply pictures if these directions don't suffice.  The train moves pretty fast once it's running properly.&lt;/p&gt;
</description>
<guid>http://hathawaymix.org/Weblog/2006-04-04</guid>
<author>shane.remove-this.if-you-are-not.a-spammer@hathawaymix.org</author>
<category>Life</category>
<pubDate>Tue, 04 Apr 2006 01:53:37 -0600</pubDate>
</item>
<item>
<title>PGStorage</title>
<link>http://hathawaymix.org/Weblog/2006-02-11</link>
<description>&lt;p&gt;I'm happy to announce &lt;a class="reference" href="http://hathawaymix.org/Software/PGStorage"&gt;PGStorage&lt;/a&gt;, a new ZODB backend that stores pickles in a PostgreSQL database.  It supports undo and packing.&lt;/p&gt;
&lt;p&gt;The intent of this storage is to be a good ZODB pickle store, and PostgreSQL has many features that simplify storage.  Taking advantage of all the capabilities of a relational database are not the intent of this storage.  The idea of a PostgreSQL storage has been attempted before, but at the time, PostgreSQL did not have the &lt;a class="reference" href="http://www.postgresql.org/docs/8.0/interactive/storage-toast.html"&gt;TOAST&lt;/a&gt; feature.  TOAST simplifies and optimizes the task of storing binary objects in the database.&lt;/p&gt;
&lt;p&gt;Multiple Zope instances can connect to the database, so ZEO is not needed when using this storage.  ZODB caches are invalidated by polling the database at transaction boundaries.  Is that an efficient choice?  Maybe.  If it's not, maybe the PostgreSQL &amp;quot;listen&amp;quot; and &amp;quot;notify&amp;quot; statements can help.&lt;/p&gt;
&lt;p&gt;I think PGStorage has a chance of beating ZEO in scaleability, since PostgreSQL's networking layer is likely to be more efficient that ZEO.  PGStorage has a fraction of the number of lines of code of FileStorage and ZEO, but it's certainly not as well tested yet.&lt;/p&gt;
&lt;p&gt;I did end up fighting with ZODB somewhat.  ZODB normally assumes that MVCC is implemented by the Connection object, but this storage prefers to depend on the MVCC capabilities already in PostgreSQL, so I had to override some methods in Connection.  I didn't fight with ZODB in this package nearly as much as I did in Ape, though.&lt;/p&gt;
&lt;p&gt;Try it out and see if it's any good.  After the first round of testing, I plan to take advantage of the two-phase commit support in PostgreSQL 8.1.&lt;/p&gt;
</description>
<guid>http://hathawaymix.org/Weblog/2006-02-11</guid>
<author>shane.remove-this.if-you-are-not.a-spammer@hathawaymix.org</author>
<category>Zope</category>
<pubDate>Sat, 11 Feb 2006 14:33:45 -0700</pubDate>
</item>
<item>
<title>Ape: Time's Up</title>
<link>http://hathawaymix.org/Weblog/2006-02-06</link>
<description>&lt;p&gt;Ape is an experiment in bridging transparent persistence to any medium.  I am primarily interested in storing on a filesystem, but the Zope community seems interested only in Ape's RDBMS persistence.  Well, here are some fundamental problems I've discovered that Ape has with RDBMS persistence:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;Ape can't add support for complex queries while retaining transparence.  Transparence would require all Ape gateways (such as filesystem support) to gain the ability to answer complex queries.&lt;/li&gt;
&lt;li&gt;Ape sends data to the database only at the end of a transaction, and in random order.  In high concurrency, this policy will lead to deadlocks.&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;
&lt;p&gt;Some more general problems with Ape:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;Ape fights with ZODB to an extent.  With every ZODB release, I've had to change Ape to override different methods and use different attributes of the Connection class.&lt;/li&gt;
&lt;li&gt;The error messages produced by Ape (especially those involving UnmanagedJars) are hard to interpret.  They can only be solved with deep study.&lt;/li&gt;
&lt;li&gt;Projects typically become easier over time.  Ape isn't getting easier.&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;
&lt;p&gt;So while I appreciate the work people have put into Ape, I no longer think it's the &lt;a class="reference" href="http://markos.gaivo.net/blog/?p=119"&gt;right idea&lt;/a&gt; for people to &lt;a class="reference" href="http://mail.python.org/pipermail/python-dev/2006-February/060415.html"&gt;spend their skills&lt;/a&gt; on.  However, I'm about to release a much simpler replacement.  Stay tuned.&lt;/p&gt;
</description>
<guid>http://hathawaymix.org/Weblog/2006-02-06</guid>
<author>shane.remove-this.if-you-are-not.a-spammer@hathawaymix.org</author>
<category>Zope</category>
<pubDate>Mon, 06 Feb 2006 07:47:53 -0700</pubDate>
</item>
<item>
<title>Sally DeFord Music</title>
<link>http://hathawaymix.org/Weblog/2005-12-07</link>
<description>&lt;p&gt;My stake has a music library.  (A stake is a group of a few thousand members of the Church in a city or neighborhood.)  The library combines the music from all the wards in our stake.  (A ward is a group that meets together on Sunday.)  The library allows all the ward choir directors to draw from a common pool of music.&lt;/p&gt;
&lt;p&gt;The library seems like a fine way to increase the music selection for all the wards, but the problem is the library is locked up most of the time.  I can make an appointment with the librarian, but I can't help but feel like I'm interrupting her life during the appointment.  I'd like to get to know the music that's there, bring a keyboard to try out some of the songs, and browse the library for a few hours, but instead, I make a hasty decision and go home within 30 minutes.  Thus the stake music library is actually quite hard for me to use.&lt;/p&gt;
&lt;p&gt;How excited I was, then, to discover &lt;a class="reference" href="http://www.defordmusic.com/"&gt;Sally DeFord Music&lt;/a&gt; through a link from &lt;a class="reference" href="http://www.windley.com/archives/2005/11/free_sheet_musi.shtml"&gt;Phil Windley&lt;/a&gt;.  She composes songs appropriate for ward choirs and releases it for free.  In chorus with Phil, may I just say thanks to Sally, who has solved a big problem.  I haven't used the music yet, but I intend to use it right away.&lt;/p&gt;
</description>
<guid>http://hathawaymix.org/Weblog/2005-12-07</guid>
<author>shane.remove-this.if-you-are-not.a-spammer@hathawaymix.org</author>
<category>Church</category>
<pubDate>Wed, 07 Dec 2005 10:29:00 -0700</pubDate>
</item>
<item>
<title>PostgreSQL and 'not x'</title>
<link>http://hathawaymix.org/Weblog/2005-10-31</link>
<description>&lt;p&gt;SQL syntax nuances make a big difference.  I have a table defined this way (details omitted for brevity):&lt;/p&gt;
&lt;pre class="literal-block"&gt;
create table stripe (
  stripeid     int not null primary key,
  all_healthy  boolean not null default true,
  enough_online  boolean not null default true,
  in_batch   boolean not null default false
);
&lt;/pre&gt;
&lt;p&gt;I also have an index on this table:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
create index stripe_schedule_helper on stripe (all_healthy, enough_online, in_batch);
&lt;/pre&gt;
&lt;p&gt;The following query takes a long time because it performs a sequential scan on the whole table.&lt;/p&gt;
&lt;pre class="literal-block"&gt;
select stripeid from stripe
where not all_healthy and enough_online and not in_batch
&lt;/pre&gt;
&lt;p&gt;The following query returns the same results but takes a tiny fraction of the time of the above query, because it uses the index:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
select stripeid from stripe
where all_healthy = false and enough_online and in_batch = false
&lt;/pre&gt;
&lt;p&gt;So, it turns out that the query optimizer perceives a major difference between the conditions &amp;quot;not x&amp;quot; and &amp;quot;x = false&amp;quot;, even when x is a non-null boolean.  I used 'explain analyze' to discover this.  It's rather surprising, but at least there's a workaround.&lt;/p&gt;
</description>
<guid>http://hathawaymix.org/Weblog/2005-10-31</guid>
<author>shane.remove-this.if-you-are-not.a-spammer@hathawaymix.org</author>
<category>Free Software</category>
<pubDate>Mon, 31 Oct 2005 08:01:50 -0700</pubDate>
</item>
<item>
<title>Computing Storage Safety</title>
<link>http://hathawaymix.org/Weblog/2005-10-26</link>
<description>&lt;p&gt;I've written a small Python script that uses the equation I derived for computing digital storage safety.  You can download it from my &lt;a class="reference" href="http://hathawaymix.org/Software/Sketches"&gt;sketches page&lt;/a&gt;.  The equation is explained in the comments.  I believe the computation is correct, and it behaves just the way I expect.  Although I'm focusing mainly on hard drive storage, the computation can also apply to any digital storage involving multiple media, such as optical disks or tapes.&lt;/p&gt;
&lt;p&gt;Here is what the script outputs:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
media unit survival probability after 1 year with MTBF=1000000: 0.991273
media unit survival probability after 1 month with MTBF=1000000: 0.999268
survival probability after 1 month, based on observation (faked for now): 0.98
unprotected volume (can lose 0): 0.98
RAID 0 with 2 units (can lose 0): 0.9604
RAID 0 with 4 units (can lose 0): 0.92236816
RAID 1 with 2 units (can lose 1): 0.9996
RAID 1 with 3 units (can lose 2): 0.999992
RAID 5 with 8 units (can lose 1): 0.989663107901
RAID 5 with 4 units (can lose 1): 0.99766352
RAID 6 with 8 units (can lose 2): 0.999584542567
RAID 6 with 4 units (can lose 2): 0.99996848
FEC, 4 data, 3 protection (can lose 3): 0.999994664346
FEC, 4 data, 4 protection (can lose 4): 0.999999829607
FEC, 12 data, 3 protection (can lose 3): 0.999816994311
FEC, 20 data, 10 protection (can lose 10): 0.999999999992
FEC, 20 data, 20 protection (can lose 20): 1.0
&lt;/pre&gt;
&lt;p&gt;The first two lines are based on the MTBF numbers published by hard drive manufacturers.  If MTBF dominates the reliability equation (although a lot of people dispute that assumption--more on that in a moment), there is a 99.13% chance one hard drive will last one year.  One year is too long to wait between verification runs, though; a period of one month is more reasonable.  The chance of survival of one hard drive over one month is thus 99.93%.&lt;/p&gt;
&lt;p&gt;More about verification: periodically, the media must go through a verification process to detect lost replicas and chunks.  If all data is recovered and all lost replicas are replaced a short time after verification, all periods have a similar data loss risk.  Thus you can base the reliability estimate on the verification period rather than the expected media lifetime.  This is a life-saver for storage area network (SAN) vendors, since they perform very frequent verification, resulting in fairly high reliability.&lt;/p&gt;
&lt;p&gt;However, hard drive life is complicated by many factors other than MTBF.  It's better to measure hard drive reliability by buying 1000 drives and counting how many survive the period between verification runs.  That's expensive, though, so I made a wild guess that 98% of hard drives survive a period of one month between verification runs.  I expect that estimate to be pessimistic.  The third and fourth lines of the script output show my wild guess.  (Does anyone have a real estimate?  I haven't found anything.)&lt;/p&gt;
&lt;p&gt;The rest of the output is based on the estimated reliability of one media unit, but even if the estimate is wrong, the script is still useful for comparing the reliability of different storage configurations.  RAID 0, which has no redundancy, causes reliability to drop.  RAID 1 (mirroring) raises reliability at the expense of storage size.  RAID 5 is less reliable than RAID 1, as expected.&lt;/p&gt;
&lt;p&gt;There's a new module in the Linux kernel that implements RAID 6.  I'm not sure the meaning of RAID 6 is standardized in the storage industry, but the module says you can lose any two drives in a RAID 6 configuration.  If it's for real, RAID 6 is an excellent middle-ground between RAID 1 and RAID 5.&lt;/p&gt;
&lt;p&gt;Now, we're talking about a lot of 9's, but the discussion so far has been about only one stripe.  (I'm using the term stripe to refer to a set of protected media units, including both data and protection bytes.)  If there are two stripes, the total reliability is the reliability of one stripe multiplied by the reliability of the other stripe.  To compute the reliability of a petabyte built on 1 TB stripes, take the reliability of one stripe and raise it to the 1000th power.&lt;/p&gt;
&lt;p&gt;Let's say I have 1000 boxes, each holding four 320 GB drives in a RAID 5 configuration.  That's 960 TB of usable (though unmanaged) space.  The script I wrote computes that each box has a reliability of 0.99766352.  (0.99766352 ** 1000) == 0.0964.  Yikes!  If the figures are right, the chance of retaining all of the data from month to month is less than 10%.  Frequent verification helps a lot, but it still doesn't make me feel comfortable enough to build a multi-petabyte storage system, since a 320 GB drive is so large that it takes hours just to scan the surface, and that's without seeking.  The verification time can only be so short.&lt;/p&gt;
&lt;p&gt;That's why all of those 9's matter.  When combining thousands of drives, to overcome the effect of the large exponent, the reliability of a single stripe has to be extremely high.  Fortunately, forward error correction like the strategy implemented in &lt;a class="reference" href="http://www.csua.berkeley.edu/~emin/source_code/dibs/"&gt;DIBS&lt;/a&gt; and the &lt;a class="reference" href="http://hathawaymix.org/Software/ReedSolomon"&gt;Reed-Solomon Python extension&lt;/a&gt; makes it possible to build such a high level of reliability that the floating point calculation rounds the estimate to 1.0.  Also note that the 20:10 FEC configuration lets you store half a protection byte for every data byte, yet the reliability ends up much higher than mirroring.&lt;/p&gt;
&lt;p&gt;That's the math.  Of course, even the best storage plans can be thwarted by fire, theft, natural disasters, or an errant system administrator..&lt;/p&gt;
</description>
<guid>http://hathawaymix.org/Weblog/2005-10-26</guid>
<author>shane.remove-this.if-you-are-not.a-spammer@hathawaymix.org</author>
<category>Python</category>
<category>Software</category>
<pubDate>Wed, 26 Oct 2005 11:51:29 -0600</pubDate>
</item>

</channel>
</rss>
