|
|
/ Hathaway Weblog / Messaging |
I wonder if Jeffrey and I are moving in the same direction with respect to technology.
I've been building a system with 6 types of nodes talking to each other in a cluster. There is one central controller, but potentially hundreds of the other 5 node types. The central controller runs ZODB and Zope 3, but I didn't like the idea of nodes talking to the central controller over HTTP as is customary in Zope. All nodes need to be able to send a message at any time as well as detect the disappearance of its peer.
So I wrote a simple asynchronous messaging protocol based on a persistent TCP connection. This strategy turned out to be an extremely effective for fault tolerance. You can kill any node and others automatically take over. Nodes never care whether their messages actually arrive at the other end; they simply wait for the next message. A broken connection is treated as just another message, and it usually results in work being suspended or moved to another node.
The central controller runs one state machine for each work item and one for each worker node. It uses transactions, but only to protect itself from internal bugs that throw an exception and would otherwise leave the state machines in an inconsistent state.
The messaging protocol is dead simple. For each message, it sends the length of a message followed by the message, which can be encoded as a Python pickle or an XML-RPC method call. (XML-RPC support is for interoperability with Java nodes.) On the other end, the message is interpreted as an ansychronous method call whose return value is ignored. I like this protocol, but for future systems, I want to find something almost as simple that someone else is maintaining.
Two contenders to replace my protocol are Spread and JMS. Both do just what I'm looking for and probably scale better than my simple protocol. A drawback of Spread is that it runs on UDP, which can be a pain to get through firewalls. A drawback of JMS is that I have to go through a Java bridge. Interestingly, as manageability.org points out, I may not have to make a choice.
