Phillip Pearson - web + electronics notes: Papers from Graphics Interface magazine | Worth reading

2002-8-13

Papers from Graphics Interface magazine

Something I came across at work today that I know I'm going to want to refer to later on in life: Some collected pre-1996 GI papers.

(In case it's not obvious: myelin.co.nz is not my job; it's what I hack on when I'm at home. I work writing image processing software in C++ at a smallish software/hardware company in New Zealand.)

... more like this: [C++, Image Processing]

Worth reading

How to be a leader in your field [via Krzysztof Kowalczyk]

How the ecosystem crawl works

Dave: "Interesting. I thought it would show up because I (and others) link to it from my blogroll. I guess the crawl you do is only one level deep? How often do you read the weblogs.com XML file?"

You're right: the crawl is only one level deep. Here are the details:

I've got one big text file which lists all the pages to download and specifies titles for them. The crawler (a Python script) reads in that text file and drops everything in a big hash table, munging the URLs a bit to detect duplicates (that http://scripting.com, http://scripting.com/ and http://www.scripting.com/ are the same site, for example).

Now it runs through all the URLs, downloading them if it doesn't already have a cached copy, processing all the HTML, downloading blogrolling.com data if necessary, and tallying up all the links.

Once it's downloaded everything, it spits out ecosystem/index.html, and all the stats pages.

Every night (or morning, or whenever I feel like it) I run a bash script that pulls down weblogs.com/changes.xml and rcs.salon.com/weblogUpdates/changes.xml, runs a Perl script that converts the blogs found within into the same format as my blog list file, then passes the output through another Perl script that removes all entries that are already in the master blog list. I then cat the output from all that onto the end of the blog list.

I have a Python script that gives me some info on the cache:

phil@icicle:~/crawler$ ./tallydates.py
August 08 - 1159
August 09 - 1304
August 10 - 1044
August 11 - 752
August 12 - 892
August 13 - 635

Total: 5786

There is also a bash script that removes the oldest 100 entries from the cache. I run this a few times before starting a crawl, to guarantee that it will refresh at least a few blog pages each time. I'm aiming to keep the whole cache younger than 7 days; as you can see from the above numbers, the earliest pages I have are from August 8.

Did I cover everything? :)