Building a search engine in Python

[see also: posts about Blog Search]

Today I'm thinking about how to make a search engine in Python. This is for the Python Community Server project. I've been asked to make a private PyCS instance searchable. It's easy to make a public community searchable - you just install mnoGoSearch on a nearby Apache server.

The situation gets a bit trickier when you add in some access control. Let's say you want to have some private weblogs on the server. Now you can't just let the search engine index them as well, or you'll get people searching for stuff and seeing exerpts of the private material (and getting links to stuff they can't see, which isn't going to make them happy).

So, what do you do? The easiest thing at this point is to install another copy of mnoGoSearch, that only searches the public site, and restrict access to the first engine alongside the private material. Now you have two search engines: one restricted one that searches the whole site, and one public one that only searches the public material.

Things are going well, until you need to add more combinations. What if you have a few blogs you want some people to be able to see, and a few that you want some other combination to be able to see? What when you want to change the permissions? You don't want to have to wait for ages for the search engine to reindex itself in response to the change, and you need a whole search database for each combination of access controls.

Basically, this solution isn't practical for more than two or three access combinations, and I'm likely to see plenty more than that in this particular application. So, it looks like PyCS is going to get its own search engine.

Here are my design notes.

Requirements

A minimal search engine needs to do two things: index and search. Ideally, the search will be able to pull out exerpts of the text (like Google does) so you can get an idea of what you're looking at from the results page.

Indexer

The indexer needs to be able to:

- parse HTML and find visible text

- ideally figure out font size / visibility for text (to distinguish titles, etc, and rank them higher than 'fine print')

- make an index out of all the words

Searcher

The searcher needs to be able to:

- given a words or a combination of words, quickly find them in the index and find the most relevant pages

- find exerpts from the resulting pages around the words found

This boils down to being able to:

- look up a word in the index and extract results

- find the intersection between sets of results

- access the source HTML

That's the overall picture. On to seeing how it's done. I see that Michal Wallace (Mr. CornerHost) has started on one in Python ("Ransacker"), but it's GPL (which I don't want PyCS to be) and I'm not entirely sure that it's very stable.

Let's see how this goes ...

[continued in Adapting existing search engine code]