Phillip Pearson - web + electronics notes

tech notes and web hackery from a new zealander who was vaguely useful on the web back in 2002 (see: python community server, the blogging ecosystem, the new zealand coffee review, the internet topic exchange).

2007-8-21

What happened to the FeedMesh?

It was an excellent idea, and had quite a few supporters a while back, but it looks like the list has disappeared from groups.yahoo.com, and feedmesh.{com|net|org} aren't showing anything any more.

Did the mesh die after PubSub carked it and Yahoo! canned blo.gs? Or is it still out there, albeit less obvious than before?

If anybody knows the answer, could you leave me an e-mail? (I really need to get comments working here again...)

... more like this: [, , , ]

Personal whole-blogosphere crawlers - still feasible?

Back when I started blogging in 2002, several of us were running code that would download the front page of every blog when it updated. There were public sites like DayPop, Blogdex, my Blogging Ecosystem, and later on Technorati, but also one or two private crawlers.

I was wondering this morning: is it still feasible (if not necessarily sensible) to run a private crawler?

Dave Sifry's last State of the Live Web report shows posting volumes at about 1.4M/day.

This puts the lower bound on data transfer at something like 2k * 1.4M = 2.8G/day, assuming an average post size of 2k, and the existence of a magic way of retrieving posts with no overhead. If you have to download the whole blog front page each time, it could be more like 50k * 1.4M = 70G/day, or just over 2T/month. RSS/Atom feeds should be a little smaller (mine is 36k compared to a 45k index page), and if you're lucky, you'll be able to use RFC3229 delta encoding to reduce that a bit more.

So it's looking feasible. Servers from LayeredTech have a bandwidth limit of 1.5T/month, so if you have fast enough code (able to pull on average 8 megabits/sec down your pipe in the worst case) and can take advantage of streams like the Six Apart Update Stream to reduce bandwidth where possible, you might be able to crawl the entire blogosphere on a not-too-expensive server.

Update: Ping-o-Matic's stats page reports volumes around 20M pings/day at the moment. Not sure who to believe right now - as I remember most of the ping volume on the Topic Exchange came from spammers, so it's possible that Technorati's results reflect real blog posts, whereas ping counts include any attempt to publicise a blog, not just actual posts.

Another datapoint: spinn3r, Tailrank's data provider backend business, reports 25k posts per hour, which is less then a million updates/day.

... more like this: []