: "Interesting. I thought it would show up because I (and others) link to it from my blogroll. I guess the crawl you do is only one level deep? How often do you read the weblogs.com XML file?"
You're right: the crawl is only one level deep. Here are the details:
I've got one big text file which lists all the pages to download and specifies titles for them. The crawler (a Python
script) reads in that text file and drops everything in a big hash table, munging the URLs a bit to detect duplicates (that http://scripting.com, http://scripting.com/ and http://www.scripting.com/ are the same site
, for example).
Now it runs through all the URLs, downloading them if it doesn't already have a cached copy, processing all the HTML, downloading blogrolling.com
data if necessary, and tallying up all the links.
Once it's downloaded everything, it spits out ecosystem/index.html
, and all the stats pages.
Every night (or morning, or whenever I feel like it) I run a bash
script that pulls down weblogs.com/changes.xml
, runs a Perl
script that converts the blogs found within into the same format as my blog list file, then passes the output through another
Perl script that removes all entries that are already in the master blog list. I then cat
the output from all that onto the end of the blog list.
I have a Python script that gives me some info on the cache:
August 08 - 1159
August 09 - 1304
August 10 - 1044
August 11 - 752
August 12 - 892
August 13 - 635
There is also a bash script that removes the oldest 100 entries from the cache. I run this a few times before starting a crawl, to guarantee that it will refresh at least a few blog pages each time. I'm aiming to keep the whole cache younger than 7 days; as you can see from the above numbers, the earliest pages I have are from August 8.
Did I cover everything? :)