Phillip Pearson - web + electronics notes

tech notes and web hackery from a new zealander who was vaguely useful on the web back in 2002 (see: python community server, the blogging ecosystem, the new zealand coffee review, the internet topic exchange).

2003-5-8

Writing a custom Query object in Lucene

(topic: search engine)

OK, here we go. Exploring the internals of Lucene -- this post is going to get pretty technical!

I'm implementing authentication using a custom Query object (called PermissibleUrlQuery), that I can fit in alongside the usual query with a BooleanQuery like this:

Query user_query = null;
try {
    user_query = QueryParser.parse(q, "description", analyzer);
} catch (ParseException e) {
    out.println("Error parsing query: " + e.getMessage() + "<br>");
}
PermissibleUrlQuery perm_query = new PermissibleUrlQuery();
BooleanQuery query = new BooleanQuery();
query.add(perm_query, true, false);
query.add(user_query, true, false);
Hits hits = searcher.search(query);


It looks like the actual scoring is done by a Scorer object. Lucene gets this from your Query by calling scorer. The scorer's score method gets called to score a group of documents (it seems that Lucene likes to do things in groups). It gets a HitCollector (whose collect method it must call for every non-zero-scoring document) and a value, maxDoc, which I guess tells you where to stop scoring.

So, where do we start scoring? We can get a TermDocs object (which lets us see all documents containing a given term) out of the IndexReader by calling reader.termDocs(term). TermScorer uses one of these - it grabs a bunch of documents out of the TermDocs object, then iterates through them (scoring as it goes), stopping when it hits a document ID higher than maxDoc.

I guess I can do pretty much the same thing: get a TermDocs object containing all the documents with stored URLs (which should be everything), then iterate over them, scoring dropping ones the current user isn't allowed to see on the floor and scoring "visible" ones with some constant value.

Aha - I've tried this out and it doesn't quite work, because it looks like each Query object is expected to independently go through the index and find documents it likes, and BooleanQuery does some set magic and merges all the results together. This isn't what I want -- I just want to filter the results that another Query gives me. I think the right way to do this is to replace BooleanQuery; get it to make a wrapper Scorer object that filters out URLs we don't want the user to see. Will try this next time!
... more like this: [, ]

CVS and SourceSafe

I've only used two source control systems, and I hate them already. Each one has its own little bit of brain-deadedness that makes it periodically totally unusable.

CVS is quite usable if you host the repository on a Linux server, trust your developers enough to give them shell accounts, and use SSH for the transport, with keys rather than passwords for authentication. Try running a server on Windows using the ntserver transport, however, and you will feel my pain. Then, once you think you've got it working (probably by going back to pserver and running without passwords), get a developer to check in something under Cygwin, and see if your line endings are all screwed up. Argh!

That said, if you use CVS for NT with OpenSSH for Windows and run the server on Linux, you'll be fine. I regularly use several SourceForge repositories on both Windows and Linux without hassle, and have a remote repository to hold things like the search engine project I've been talking about recently, and nothing's wrong there.

Source Safe is another matter. We use it at work with Visual Studio, and it very integrates nicely with the IDE. You get a 'pending checkins' window that shows you which files you've got exclusive locks on, and it asks you if you want to check something out when you start editing it. Very handy. However, it seems to periodically almost hang the machine it's running on. Get Latest Version (the equivalent of cvs update) sometimes takes all day, and half the time when I try to check something out (get an exclusive lock) it just fails for an unspecified reason. Scary.

That said, it understands renaming files (unlike CVS) and has a pretty graphical diff utility. Unfortunately neither features makes the general feel of instability any less scary ...
... more like this: []