Connecting ht://Dig to PyCS

2003-2-20

Here's how to connect the ht://Dig search engine to PyCS, and get a decent search function for your weblogs that respects any access controls you may have set up.

NOTE: I ran into stability problems running this on a FreeBSD server. It seemed to work fine on my development Linux box, though. You might want to test it for a while before running it on a public system.

First, note that ht://Dig is covered by the GPL, whereas PyCS is covered by the MIT license. They are compatible, but any combination will end up being covered by the GPL, so you can't make a closed source version of PyCS that includes the ht://Dig connection.

Prerequisites

How to do it

First, get the modified version of ht://Dig (more info).

rm htdig-pycs-snapshot-*.tar.gz
wget http://www.myelin.co.nz/pycs_search/latest.php

Unpack and apply.

tar -vzxf htdig-pycs-snapshot-*.tar.gz

Configure, make, make install (change /usr/local to /usr if your Python libraries are in /usr/lib/python2.2 instead of /usr/local/lib/python2.2).

cd htdig
./configure --prefix=/usr/local --with-python=yes
make

Now install the files (you need to be root to do this).

su
make install
exit

Now patch Medusa; edit src/pycs/medusa/http_server.py and go to line 498. Change the following:

                        # r.handler = h # CYCLE
                        h.handle_request (r)
                    except:
                        self.server.exceptions.increment()
                        (file, fun, line), t, v, tbinfo = asyncore.compact_traceback()

To this:

                        # r.handler = h # CYCLE
                        h.handle_request (r)
                    except SystemExit:
                        raise
                    except:
                        self.server.exceptions.increment()
                        (file, fun, line), t, v, tbinfo = asyncore.compact_traceback()

Now PyCS needs to know where you have installed the _htsearch module (it is in /usr/local/cgi-bin/ in this example). Edit etc/pycs/pycs.conf and add the following lines:

enablehtdig = yes
htsearchpath = /usr/local/cgi-bin
htsearchconf = /usr/local/etc/htdig.pycs.conf

Now you should be able to restart PyCS and the /modules/search.py page will work. If you get an error about not having the search path or config path set up, check to make sure the lines are in the right place in pycs.conf. Look in the var/log/pycs/error.log and etc.log files for messages that might help. Also try running the /usr/local/cgi-bin/qtest program to make sure your ht://Dig installation is working properly. It should return some matches - "No matches" is a sign that your database isn't working or you haven't run the crawler yet. Don't forget to edit the config file and run rundig to initialise the database ... you might need to mkdir -p /usr/local/var/htdig before it'll work.

Still having trouble? Drop me a line () and I'll see what I can do; I've probably missed something out of these instructions ;-)

Related work

pysearch, by Paul Erickson. It goes further than I have with the output, and returns it as some sort of collection of Python objects. However, it doesn't run with newer versions of ht://Dig and Python.