Writing a custom Query object in Lucene
(topic: search engine)
OK, here we go. Exploring the internals of Lucene -- this post is going to get pretty technical!
I'm implementing authentication using a custom
It looks like the actual scoring is done by a
So, where do we start scoring? We can get a
I guess I can do pretty much the same thing: get a
Aha - I've tried this out and it doesn't quite work, because it looks like each
... more like this: [blog search, lucene]
OK, here we go. Exploring the internals of Lucene -- this post is going to get pretty technical!
I'm implementing authentication using a custom
Query
object (called PermissibleUrlQuery
), that I can fit in alongside the usual query with a BooleanQuery
like this:Query user_query = null;
try {
user_query = QueryParser.parse(q, "description", analyzer);
} catch (ParseException e) {
out.println("Error parsing query: " + e.getMessage() + "<br>");
}
PermissibleUrlQuery perm_query = new PermissibleUrlQuery();
BooleanQuery query = new BooleanQuery();
query.add(perm_query, true, false);
query.add(user_query, true, false);
Hits hits = searcher.search(query);
It looks like the actual scoring is done by a
Scorer
object. Lucene gets this from your Query
by calling scorer
. The scorer's score
method gets called to score a group of documents (it seems that Lucene likes to do things in groups). It gets a HitCollector
(whose collect
method it must call for every non-zero-scoring document) and a value, maxDoc
, which I guess tells you where to stop scoring.So, where do we start scoring? We can get a
TermDocs
object (which lets us see all documents containing a given term) out of the IndexReader
by calling reader.termDocs(term)
. TermScorer
uses one of these - it grabs a bunch of documents out of the TermDocs
object, then iterates through them (scoring as it goes), stopping when it hits a document ID higher than maxDoc
.I guess I can do pretty much the same thing: get a
TermDocs
object containing all the documents with stored URLs (which should be everything), then iterate over them, scoring dropping ones the current user isn't allowed to see on the floor and scoring "visible" ones with some constant value.Aha - I've tried this out and it doesn't quite work, because it looks like each
Query
object is expected to independently go through the index and find documents it likes, and BooleanQuery
does some set magic and merges all the results together. This isn't what I want -- I just want to filter the results that another Query
gives me. I think the right way to do this is to replace BooleanQuery
; get it to make a wrapper Scorer
object that filters out URLs we don't want the user to see. Will try this next time!