Phillip Pearson - web + electronics notes

tech notes and web hackery from a new zealander who was vaguely useful on the web back in 2002 (see: python community server, the blogging ecosystem, the new zealand coffee review, the internet topic exchange).


Distributing initial configuration information

One thing I've never seen satisfactorily explained for distributed fault-tolerant web apps is how to distribute configuration information. For example, if you have a cluster of web servers and a pair of MySQL database servers in dual-master-single-writer configuration (both configured to replicate from each other for easy failover, but with only one handling writes), how do the web servers figure out who the master should be? Especially, when a master server goes down, then comes back online (possibly in an inconsistent state), how does it know whether it should become active again or not?

One failure model I have in mind is where you have [clusters of] servers in different locations, such that you need to change DNS to fail over from one location to another. In this case, some clients may still have the old IP address for your website in their cache, so if a colo loses its network connection and the site fails over to a different colo, but then the initial colo comes back up, it needs to know ASAP whether it's meant to be serving traffic or redirecting to the other location, or there's a danger that it will accept updates from clients, which will be hard to reconcile later. (Assuming you're not using an 'eventual consistency' model, in which case the system will probably sort things out by itself).

Also, say each cluster hosts DNS. A cluster that has recently lost its network connection shouldn't start serving DNS immediately upon going back online, in case a failover operation has resulted in records changing.

What I'd like [to build?] is a service that distributes configuration information between hosts, in such a way that the hosts can determine whether they are safe to go online or not, and which hosts are currently authoritative. So when a web host starts up, it can conclusively figure out where its database servers are and whether it should be redirecting (or answering at all), and when a master database host starts up, it can conclusively figure out whether it should accept updates from clients or whether it should rebuild itself from a new master.

The best lead I have so far is the Paxos algorithm, and Google's Chubby filesystem/lock manager. I think the trick will be for each host to maintain a local database of settings (host locations, active colos etc), versioned with the Paxos proposal number, and to continuously poll other hosts. When a host loses contact with enough other hosts to not have a quorum for Paxos, it should deactivate itself.

I haven't thought this through well enough yet, though. Another interesting thing to try would be for all hosts to determine network connectivity, and to vote on what other hosts/colos are down, to determine what to do with DNS and database master selection...

Update: I've found a project that may do this: see my next post, on ZooKeeper.

Hadoop up and running... now what to do with it?

I've had a pile of unused but relatively powerful (one Athlon 1333, one Athlon XP 2100+ and two Athlon XP 2800+) computers sitting around not doing anything for the last little while. I've been meaning to hook them all up together, along with the various laptops in the house, into some sort of cluster to play around with for a while. Finally got around to installing Hadoop on a couple of yesterday, so now I have a little mini-cluster here.

Hadoop is interesting; my initial expectation was that it would be a general distributed task/job handling system that happens to handle Map/Reduce type workflows, but from the looks of things it's more the other way around - it's completely built to run Map/Reduce jobs (in particular ones which handle large numbers of records from log files or databases) and can possibly be hacked to handle general distributed jobs.

The one distributed job I want to try to run on it doesn't quite fit the Map/Reduce model, but that's probably just because of how it's structured right now. It's an "embarrassingly parallel" type problem with lots of brute-force testing (machine learning stuff) that boils down to something like average { map { evaluate(best_of(map { test_candidate } find_candidates)) } (split $input) }.

I guess the way to do it is to run it as three consecutive Map/Reduce jobs.

1. Only Map: $sets_of_candidates = map { find_candidates } (split $input)

2. Map/Reduce: $best_candidates = best_of { map { test } ($sets_of_candidates) }

3. Map/Reduce: $evaluation = average { map { evaluate } ($best_candidates) }

An interesting observation is that while the model is named Map/Reduce, the initial step, splitting the data up, is really important too. I get the impression that much of the work that's gone into Hadoop has been in the area of intelligently dividing data between workers.