Adapting existing search engine code

[continued from Building a search engine in Python]

[see also: posts about Blog Search]

In the article yesterday, I presented a possible plan of action to provide a reliable search facility for Python Community Server that respects the built in access control by completely writing everything from scratch. However, this isn't the only way to achieve that goal. Another way is to write a new front-end for existing search software.

There are problems with this:

Integration: an external search engine won't be as easy to install as something built in to PyCS. For example, here are my notes from yesterday on installing mnoGoSearch on FreeBSD - a nontrivial activity.

License: most existing search engines are GPL. This isn't a big deal for individual PyCS installations, but I do want to keep PyCS free for inclusion in other commercial software.

However, these problems are outweighed by the good side:

Quality: other search engine packages have a development head start of many programmer-years. Thus, we can assume that they don't suck as much as something built from scratch would at first.

Speed: using a pre-existing package means we don't need to spend any time writing the blocks that aren't PyCS-specific. That's pretty much everything.

Here are some possible search engine packages that we could use:

mnoGoSearch
This is the only one I have experience with. It seems OK, although the installation I have here seems to have difficulty with displaying context. All the search results show the first few lines of the result page, but don't show anything around the text that's been found. This makes the context fairly useless for a blog server like PyCS. Here's an example of it searching its own docs.

Used by: MySQL, Debian

ht://dig
This is another popular (GPL) search engine package. I seem to recall it being installed by default in a Debian installation I used years ago, so presumably it's also reasonably mature. Here's an example of it searching its own docs.

Problem: Phrase searching has been added for the 3.2 release, which is currently in the beta phase. Hopefully it'll be out of beta soon.

Used by: KDE, ezmlm, FreeCiv, Grokking the GIMP

WebGlimpse
A web version of the classic 'glimpse' searcher? Doesn't seem too effective.

Used by: faqs.org

ZCatalog
This is the search engine used in Zope. It doesn't look like it needs to run as part of Zope, though - ZCatalog is a wrapper on top of Catalog, a standalone search engine in Python. Perhaps this is exactly what we're looking for. Catalog requires the ZODB, which is free and should be fine to include in PyCS.

It doesn't seem to have a proper homepage. Where can I download it? Notes on ZWiki. Here's the ZCatalog chapter in the Zope Book. Aha - StandaloneZCatalog! (See also StandaloneZODB / old SF home).

indexer.py
By David Mertz (Gnosis Software) - a full-text indexer in Python. Can search for multiple words in a document but doesn't do proximity search.

The most interesting of the above is ZCatalog, because it's in Python and not GPL. That means we might be able to properly integrate it into PyCS. I'll see if I can set up a copy of Zope somewhere and test it out.

If that's not good enough, ht://dig is likely to be the most useful, with a custom interface in Python that takes the search results and only displays those the user has access to.

Do you know of any other search engines? Drop me a line (