|
|
/ Hathaway Weblog / Python XML and the CPU cache |
The Python XML performance benchmarks just shot through the roof. Congratulations, Fredrik!
The Daily Python-URL pointed out an article written by Uche Ogbuji, who is also contending for a title in the speedy XML parser arena. I agree with Uche that XML parsers must support unicode or they can't compete. However, Uche's benchmark used a large XML document--the entire Old Testament! In doing so, he blew the CPU cache, and I think that's why the numbers came out so different.
I learned a hard lesson in CPU caches a few weeks ago. I had written a small, optimized routine in Java for flipping and rotating raw 8-bit grayscale images. The routine seemed plenty fast, but profiling revealed that this routine was a bottleneck, taking about 1 second for each 20 MB image. So I converted the routine to C and wrote my first JNI module. To my horror, the C code turned out to be twice as slow! Researching further, I realized the way I was using JNI caused the image data (stored in a byte array) to be copied during the transformation. After switching to lower-level JNI functions, the C code ran at the same speed as the Java code.
Since the simple routine was a tight loop that compiled to only 20 machine instructions or so, I thought the 2.4 GHz processor should have finished the work in milliseconds. It turns out I had failed to account for the size of the CPU cache. Because the images could not fit in the CPU cache, my routine was bound by memory bandwidth. To solve the problem, I moved the routine to other compute nodes that could perform the work in parallel. I had to convert the JNI module to a Python extension, but the translation was surprisingly easy.
So I'm guessing Fredrik's code is now so fast that, for large documents, it's bound by memory bandwidth rather than CPU. I think Uche unwittingly measured some multiple of his memory bandwidth rather than the raw speed of the XML processor. A fair benchmark with a large document would require a 4+ MB L1 cache. Most L1 caches are in the 128 - 1024 KB range.
Comments
I somewhat doubt the CPU cache had anything to do with it. With cElementTree on my computer I can do Uche's benchmark in about 0.23 seconds (that's parsing and looking for the verses with 'begat'). My computer isn't that much faster than Uche's. Besides that, Fredrik has been using the same file for his own parsing benchmarks the entire time.
There are a number of things odd with Uche's benchmark. Uche doesn't seem to want to talk about what he did exactly, so we'll never know. See my comments and this article in my blog for more.
Hey cool, thanks for pointing out your weblog. Do you publish an RSS feed? I see there's RDF embedded, but my reader (the Firefox Sage extension) doesn't see it.
You're right, there's just not enough information to know. I still blame the CPU cache, but having just come to understand how it affects software developers, I admit I'm trying to find problems that relate to it. :-) It's possible that you have a large enough cache to keep the important parts of the Python interpreter, cElementTree, and whatnot in the cache through the entire document, while Uche does not.
Note that the 200 microsecond benchmark you linked to was intentionally flawed (very few people have spotted the trick, though). For an explanation, see:
http://online.effbot.org/2005_01_01_archive.htm#20050127
And for the record, the correct time is about 4 milliseconds. (20 times slower, but still over 1000 times faster than the competing library Aaron mentions in this article ;-)
A cache analysis of cET would be quite interesting. It's likely that cET has a much smaller working set than many other toolkits, which might explain the "superscalar" results I've been seeing on some machines. But I don't have time to dig into that right now. So much code, so little time...
I still highly doubt the cache can explain much of the performance differences, as libxml2 can do a parse in about the same time as cElementTree (a bit slower, usually, but it doesn't matter a lot), and it creates a structure a lot bigger than cElementTree does. /proc/cpuinfo claims that my Athlon XP 2800+ has a cache of 512 kilobytes. Note that I've tried the software also on the exact same laptop model (Dell Inspiron 8600, centrino 1700 mhz) as Uche did, and I don't get vast differences in performance characteristics on it.
The XML icon on my weblog page has a link to the RSS feed (2.0, but I think newsbruiser adapts):
