Phillip Pearson - web + electronics notes

tech notes and web hackery from a new zealander who was vaguely useful on the web back in 2002 (see: python community server, the blogging ecosystem, the new zealand coffee review, the internet topic exchange).

2005-8-13

More on "reverse" search

Bob Wyman commented on the previous post and implied that PubSub doesn't use a tree-based method. Interesting!

30K messages per second would give you 80k CPU cycles per message, which sounds pretty reasonable.

If I were implementing a tree-based system, I'd do it the way Phillip Eby suggests - make the tree add-only, and either garbage collect or just rebuild the whole tree once in a while. If you think of the tree as being built out of nested hashtables, you could build it one subscription at a time by compiling the subscription into a standalone tree and effectively ORing it on top of the existing tree: where a branch doesn't exist, create it, otherwise follow the existing branch. This should be approximately linear with number of subscriptions (it will get a little slower as the tree grows as your hashtables won't be as fast, but hashtable lookups - which could be binary searches here, because we're just dealing with numbers - are O(log N), which isn't so bad).

My "bogged down" comment was referring more to the case when you only properly index the first "required word" of each subscription, instead of properly merging stuff into a tree. I believe this sort of algorithm could handle complex structured data, but I see what you mean - things get harder there. Matching simple attributes, like a PubSub search for "TITLE:foo", isn't so bad - you start off by dividing the post into its component attributes, and then run the search on each of them in turn. Alternatively you could build it into the tree - a search for "TITLE:foo" would start off matching the word "foo" and then after that, there would be a test to see if it was in the title. (Otherwise you would need to evaluate every word in the post twice - once with the attribute attached and once without, so a title containing "foo" would match searches for both "TITLE:foo" and "foo").

Something I was thinking of that didn't make it into the post was that "evaluating the whole expression tree" for each word doesn't have to mean doing it the slow way for complex queries. It could be done in a proper recursive manner - so the procedure for moving down the tree would be the same whether you are at the root or deeper inside. If the current node has 10K children, you do the query "backwards" - walk through the words in the post being matched and find matching child nodes that way - whereas if it only has 5, you evaluate it in the traditional way.

... more like this: [, ]