Phillip Pearson - Second p0st

tech notes and web hackery from the guy that brought you bzero, python community server, the blogging ecosystem, the new zealand coffee review and the internet topic exchange

2003-3-13

Python: How to fork and return text from the child process

Nothing original here, but there are a couple of quirks, so I'm going to write it up in case anyone else is looking for how to do this. The process is the same as in C.

Updated 2003-4-10: added in waitpid call.

Why?

Sometimes, inside an application, for some reason you don't want to run something in-process. Maybe you don't trust the code you're running and want to use ulimit to restrict it from allocating too much memory or opening too many files, or maybe it Just Doesn't Work in-process.

In my case, I'm trying to run htsearch inside a single-threaded (single-process) web server, except it's accustomed to being run as a CGI, which means it leaves all sorts of junk lying around to be cleaned up by the OS once it's done, and if I don't use this hack to force the OS to clean up after it, I'll end up with one hell of a memory leak.

How it works

Basically, you need to split your main process in two, execute some code in the child process (the new process that you will be creating) and return some results back to the parent process (the main process).

The mechanism of transferring the results is called a pipe. The os.pipe() function creates a pipe. A pipe consists of two ends - one for reading, and one for writing. The child will keep the write end, and the parent will keep the read end. The child will write its results into the write end, then close it (as it exits), and the parent will rad them out of the read end.

You split the process with os.fork(). This also duplicates all file descriptors, i.e. now the parent and the child both have copies of both the read and the write ends of the pipe. As the pipe doesn't close until all copies of one end are closed, you need to close (os.close()) the write end on the parent side after forking.

So, the process is:

- create pipe
- fork

child:
- close read end of the pipe
- do work
- write results to write end of the pipe
- die silently

parent:
- close write end of the pipe
- suck on read end of the pipe until the child dies and it closes
- call waitpid to be absolutely sure that the child will clean itself up (it will never die on FreeBSD if you don't do this)
- process output
The code

#!/usr/bin/env python

import os, sys

print "I'm going to fork now - the child will write something to a pipe, and the parent will read it back"

r, w = os.pipe() # these are file descriptors, not file objects

pid = os.fork()
if pid:
    # we are the parent
    os.close(w) # use os.close() to close a file descriptor
    r = os.fdopen(r) # turn r into a file object
    print "parent: reading"
    txt = r.read()
    os.waitpid(pid, 0) # make sure the child process gets cleaned up
else:
    # we are the child
    os.close(r)
    w = os.fdopen(w, 'w')
    print "child: writing"
    w.write("here's some text from the child")
    w.close()
    print "child: closing"
    sys.exit(0)

print "parent: got it; text =", txt

Dan Sugalski has a blog

I'm probably the last one to realise this, but Dan Sugalski (of Perl fame) has a blog. Nice.

PyCS has better logs now

Lots of PyCS development going on right now, from the two of us on opposite sides of the world. I'm hacking on the search engine integration here, and Georg has been working on improving the log file format so he can run webalizer on his logs. Looks like it's going nicely. I'll update the code on pycs.net sometime, and then we can have nicer stats too!

(BTW why am I writing this sort of thing here? Isn't it meant to go in the devlog? :-)

ht://Dig static member variables

It looks like the htsearch module, which is what I Pythonized to get the screenshot below, isn't happy about running multiple times per session. I'm guessing it's because there are static member variables (i.e. HtConfiguration::_config) that are initialised on startup, except when I call main() more than once, it gets confused. Presumably the code is relying on ths OS cleaning everything up automatically when it's done.

Here are the ones that look like they could be a problem:

../htdig/htcommon/HtConfiguration.h: static HtConfiguration* _config;
../htdig/htsearch/ExactWordQuery.h: static WordSearcher *searcher;
../htdig/htsearch/Query.h: static QueryCache *cache;
../htdig/htsearch/QueryParser.h: static FuzzyExpander *expander;
../htdig/htsearch/ResultMatch.h: static SortType mySortType;
../htdig/htsearch/VolatileCache.h: static ResultList * const empty;
../htdig/htword/WordDBInfo.h: static WordDBInfo* instance;
../htdig/htword/WordKeyInfo.h: static WordKeyInfo* instance;
../htdig/htword/WordMonitor.h: static char* values_names[WORD_MONITOR_VALUES_SIZE];
../htdig/htword/WordMonitor.h: static WordMonitor* instance;
../htdig/htword/WordRecordInfo.h: static WordRecordInfo* instance;
../htdig/htword/WordStat.h: static WordReference* word_stat_last;
../htdig/htword/WordType.h: static WordType* instance;


An alternative would be to fork a new process, import _htsearch there, pass the results back through a pipe, then blow away the new process. That would be fairly safe, but pretty slow. But then, I guess the normal ht://Dig searcher (htsearch) isn't that quick anyway, as it runs via CGI, so maybe it doesn't matter.

htdig integration

OK, now I can search again (second try, after throwing away the code I wrote weeks ago and working from ht://Dig CVS instead). Time to get it going inside PyCS.

Anyone reading this worked with libtool, and can tell me how to get it to generate Python modules that reference shared libs but don't force me to set LD_LIBRARY_PATH before running them? Is that a setting in the shared lib or in the app linking to it? I don't know enough about how libraries work on Unix (Linux + FreeBSD in this case) yet to figure this one out.

BTW PyCS runs very nicely on an Athlon XP 2000+ :-)

Results:

[Picture of a browser window showing the search script running]

Now to get the button images to show, unwanted files (rss.xml etc) to be filtered out, and to clean up the results ... and the PyCS search engine will be working!
... more like this: [, ]