/ Hathaway Weblog / Python XML and the CPU cache

Shane :: Python :: January 27, 2005 # Python XML and the CPU cache

The Python XML performance benchmarks just shot through the roof. Congratulations, Fredrik!

The Daily Python-URL pointed out an article written by Uche Ogbuji, who is also contending for a title in the speedy XML parser arena. I agree with Uche that XML parsers must support unicode or they can't compete. However, Uche's benchmark used a large XML document--the entire Old Testament! In doing so, he blew the CPU cache, and I think that's why the numbers came out so different.

I learned a hard lesson in CPU caches a few weeks ago. I had written a small, optimized routine in Java for flipping and rotating raw 8-bit grayscale images. The routine seemed plenty fast, but profiling revealed that this routine was a bottleneck, taking about 1 second for each 20 MB image. So I converted the routine to C and wrote my first JNI module. To my horror, the C code turned out to be twice as slow! Researching further, I realized the way I was using JNI caused the image data (stored in a byte array) to be copied during the transformation. After switching to lower-level JNI functions, the C code ran at the same speed as the Java code.

Since the simple routine was a tight loop that compiled to only 20 machine instructions or so, I thought the 2.4 GHz processor should have finished the work in milliseconds. It turns out I had failed to account for the size of the CPU cache. Because the images could not fit in the CPU cache, my routine was bound by memory bandwidth. To solve the problem, I moved the routine to other compute nodes that could perform the work in parallel. I had to convert the JNI module to a Python extension, but the translation was surprisingly easy.

So I'm guessing Fredrik's code is now so fast that, for large documents, it's bound by memory bandwidth rather than CPU. I think Uche unwittingly measured some multiple of his memory bandwidth rather than the raw speed of the XML processor. A fair benchmark with a large document would require a 4+ MB L1 cache. Most L1 caches are in the 128 - 1024 KB range.

Comments

Martijn Faassen (January 28, 2005 05:20)

I somewhat doubt the CPU cache had anything to do with it. With cElementTree on my computer I can do Uche's benchmark in about 0.23 seconds (that's parsing and looking for the verses with 'begat'). My computer isn't that much faster than Uche's. Besides that, Fredrik has been using the same file for his own parsing benchmarks the entire time.

There are a number of things odd with Uche's benchmark. Uche doesn't seem to want to talk about what he did exactly, so we'll never know. See my comments and this article in my blog for more.

Shane Hathaway (January 29, 2005 00:44)

Hey cool, thanks for pointing out your weblog. Do you publish an RSS feed? I see there's RDF embedded, but my reader (the Firefox Sage extension) doesn't see it.

You're right, there's just not enough information to know. I still blame the CPU cache, but having just come to understand how it affects software developers, I admit I'm trying to find problems that relate to it. :-) It's possible that you have a large enough cache to keep the important parts of the Python interpreter, cElementTree, and whatnot in the cache through the entire document, while Uche does not.

Fredrik (January 29, 2005 07:11)

Note that the 200 microsecond benchmark you linked to was intentionally flawed (very few people have spotted the trick, though). For an explanation, see:

http://online.effbot.org/2005_01_01_archive.htm#20050127

And for the record, the correct time is about 4 milliseconds. (20 times slower, but still over 1000 times faster than the competing library Aaron mentions in this article ;-)

A cache analysis of cET would be quite interesting. It's likely that cET has a much smaller working set than many other toolkits, which might explain the "superscalar" results I've been seeing on some machines. But I don't have time to dig into that right now. So much code, so little time...

Martijn Faassen (January 31, 2005 05:57)

I still highly doubt the cache can explain much of the performance differences, as libxml2 can do a parse in about the same time as cElementTree (a bit slower, usually, but it doesn't matter a lot), and it creates a structure a lot bigger than cElementTree does. /proc/cpuinfo claims that my Athlon XP 2800+ has a cache of 512 kilobytes. Note that I've tried the software also on the exact same laptop model (Dell Inspiron 8600, centrino 1700 mhz) as Uche did, and I don't get vast differences in performance characteristics on it.

The XML icon on my weblog page has a link to the RSS feed (2.0, but I think newsbruiser adapts):

http://faassen.n--tree.net/blog/syndicate/weblog

No further comments may be added.

2 Nephi 28:7-9 (Click below to fill in the blanks.)
Your browser is not able to display the scripture fill-in program. To see it, enable Javascript or use Mozilla 1.0 or better.

Church: lds scriptures provident games pearls kzion shiblon film chancellor gateway cumorah byutv happiness nephi
Zope: freezope org com zen labs newbies zettai warnes
Python: home pyzine daily icanprogram
Genealogy: cyndi
Weblogs: jeffrey paul jon joel another-shane guido barry jeremy windley chrism zac
News: quakes lwn dc weather deseret zeitgeist softwarelivre
Zaurus: software developer
Tech: tango spintronics thin
Semantic: aaron sean
Reference: css rdf html4 javascript geckodom iecss emacs phrases acronyms
Reverse: advogato slashdot
Misc: gimp-savvy directory soda jokes shouldexist pdphoto