One thing I've never seen satisfactorily explained for distributed fault-tolerant web apps is how to distribute configuration information. For example, if you have a cluster of web servers and a pair of MySQL database servers in dual-master-single-writer configuration (both configured to replicate from each other for easy failover, but with only one handling writes), how do the web servers figure out who the master should be? Especially, when a master server goes down, then comes back online (possibly in an inconsistent state), how does it know whether it should become active again or not?
One failure model I have in mind is where you have [clusters of] servers in different locations, such that you need to change DNS to fail over from one location to another. In this case, some clients may still have the old IP address for your website in their cache, so if a colo loses its network connection and the site fails over to a different colo, but then the initial colo comes back up, it needs to know ASAP whether it's meant to be serving traffic or redirecting to the other location, or there's a danger that it will accept updates from clients, which will be hard to reconcile later. (Assuming you're not using an 'eventual consistency' model, in which case the system will probably sort things out by itself).
Also, say each cluster hosts DNS. A cluster that has recently lost its network connection shouldn't start serving DNS immediately upon going back online, in case a failover operation has resulted in records changing.
What I'd like [to build?] is a service that distributes configuration information between hosts, in such a way that the hosts can determine whether they are safe to go online or not, and which hosts are currently authoritative. So when a web host starts up, it can conclusively figure out where its database servers are and whether it should be redirecting (or answering at all), and when a master database host starts up, it can conclusively figure out whether it should accept updates from clients or whether it should rebuild itself from a new master.
The best lead I have so far is the Paxos algorithm, and Google's Chubby filesystem/lock manager. I think the trick will be for each host to maintain a local database of settings (host locations, active colos etc), versioned with the Paxos proposal number, and to continuously poll other hosts. When a host loses contact with enough other hosts to not have a quorum for Paxos, it should deactivate itself.
I haven't thought this through well enough yet, though. Another interesting thing to try would be for all hosts to determine network connectivity, and to vote on what other hosts/colos are down, to determine what to do with DNS and database master selection...
Update: I've found a project that may do this: see my next post, on ZooKeeper.