Back when I started blogging in 2002, several of us were running code that would download the front page of every blog when it updated. There were public sites like DayPop, Blogdex, my Blogging Ecosystem, and later on Technorati, but also one or two private crawlers.
I was wondering this morning: is it still feasible (if not necessarily sensible) to run a private crawler?
Dave Sifry's last State of the Live Web report shows posting volumes at about 1.4M/day.
This puts the lower bound on data transfer at something like 2k * 1.4M = 2.8G/day, assuming an average post size of 2k, and the existence of a magic way of retrieving posts with no overhead. If you have to download the whole blog front page each time, it could be more like 50k * 1.4M = 70G/day, or just over 2T/month. RSS/Atom feeds should be a little smaller (mine is 36k compared to a 45k index page), and if you're lucky, you'll be able to use RFC3229 delta encoding to reduce that a bit more.
So it's looking feasible. Servers from LayeredTech have a bandwidth limit of 1.5T/month, so if you have fast enough code (able to pull on average 8 megabits/sec down your pipe in the worst case) and can take advantage of streams like the Six Apart Update Stream to reduce bandwidth where possible, you might be able to crawl the entire blogosphere on a not-too-expensive server.
Update: Ping-o-Matic's stats page reports volumes around 20M pings/day at the moment. Not sure who to believe right now - as I remember most of the ping volume on the Topic Exchange came from spammers, so it's possible that Technorati's results reflect real blog posts, whereas ping counts include any attempt to publicise a blog, not just actual posts.
Another datapoint: spinn3r, Tailrank's data provider backend business, reports 25k posts per hour, which is less then a million updates/day.
... more like this: [