Dave:
"Interesting. I thought it would show up because I (and others) link to it from my blogroll. I guess the crawl you do is only one level deep? How often do you read the weblogs.com XML file?"
You're right: the crawl is only one level deep. Here are the details:
I've got one big text file which lists all the pages to download and specifies titles for them. The crawler (a
Python script) reads in that text file and drops everything in a big hash table, munging the URLs a bit to detect duplicates (that http://scripting.com, http://scripting.com/ and http://www.scripting.com/ are
the same site, for example).
Now it runs through all the URLs, downloading them if it doesn't already have a cached copy, processing all the HTML, downloading
blogrolling.com data if necessary, and tallying up all the links.
Once it's downloaded everything, it spits out
ecosystem/index.html, and all the stats pages.
Every night (or morning, or whenever I feel like it) I run a
bash script that pulls down
weblogs.com/changes.xml and
rcs.salon.com/weblogUpdates/changes.xml, runs a
Perl script that converts the blogs found within into the same format as my blog list file, then passes the output through
another Perl script that removes all entries that are already in the master blog list. I then
cat the output from all that onto the end of the blog list.
I have a Python script that gives me some info on the cache:
phil@icicle:~/crawler$ ./tallydates.py
August 08 - 1159
August 09 - 1304
August 10 - 1044
August 11 - 752
August 12 - 892
August 13 - 635
Total: 5786
There is also a bash script that removes the oldest 100 entries from the cache. I run this a few times before starting a crawl, to guarantee that it will refresh at least a few blog pages each time. I'm aiming to keep the whole cache younger than 7 days; as you can see from the above numbers, the earliest pages I have are from August 8.
Did I cover everything? :)