Phillip Pearson - web + electronics notes

tech notes and web hackery from a new zealander who was vaguely useful on the web back in 2002 (see: python community server, the blogging ecosystem, the new zealand coffee review, the internet topic exchange).

2002-8-3

Q & A

It's time to answer some of the questions and requests (also some I've receved by email) about the ecosystem:

My link counts are wrong!

Cool - tell me and I'll fix it. I know that the counts aren't perfect; they're certainly different from TTLB's ecosystem at times. I'm continually ironing out bugs, but I need to have examples of things not working before I can find out why. So drop me a line and I'll sort it out for you!

Looks like you are missing the forward links for people using Blogrolling.com

Yes - I was :)

Not now though! The crawler will now search through your blog and look for blogrolling.com links, then follow them and pull out all the links from there. As such, people like Jim S have jumped rather markedly in the 'most prolific linkers' column!

What is gzip?

gzip is a compression program - it does more or less the same thing as WinZip, except it's not tied to Windows. It's embedded in a lot of web servers and browsers, and acts to compress the web pages as they get sent across the 'net. Basically it makes pages load faster by reducing the amount of data sent.

Right now, the ecosystem main page is 290,817 bytes long, but it compresses down to 34,347 (an 89% reduction). This means that it takes about 2 seconds for the web server to send it out rather than about 20 ;-)

What is format of the ecosystem file?

I'm using MetaKit to store the data, but I'm moving to just using the cache files (click on the 'c' next to a blog name in the ecosystem main page). The only stuff that ends up on disk is:

- the list of blogs

- a copy of all the web pages the crawler downloads (the cache pages)

- the ecosystem main page and stats pages

Unfortunately that means I don't have a nice compact summary of the data that people can download and play with. However, if anybody would like me to, I can produce something in XML or some other format that's easier to parse.

Either that or I can just give you the source code for the crawler and you can build it yourself ;-)

It would be interesting to see a ranking number on the lists as they are rather long now.

Done - everything is ranked now :)

Note that I'm only showing the top 500 in each list, as the HTML page was getting too long and my web server was having trouble. I'll put the other ones back onto some other pages when I have some free time.

Also we need some better graph analysis to show clusters of related sites. I'll maybe post some ideas I had on this on my site. I have only been meaning to write the paper ten years.

Good idea - I'll be watching your site ;-)

That's been one of my goals since the start, but I haven't had the time to implement it.

Apparently TTLB has something up his sleeve, but he's not telling me yet!
... more like this: []

Battle of the blogging tools

I added radio.userland.com, blogger.com, manila.userland.com and movabletype.org into the ecosystem; in an attempt to see how many blogs I'm scanning from each tool. Most bloggers link back to the tool they used - AFAIK this is a requirement fo Blogger and Movable Type, but not for Radio.

Having a default template which links back to your tool site looks like a good way to guarantee yourself a great Google ranking; you automatically get a huge number of links: Radio (17,900), MT (21,600), Manila (30,000) and Blogger (33,400).

Google includes every archive page as well, which really pushes the link count up. My results ended up as follows:

MT: 643 links
Blogger: 534 links
Radio: 256 links
Manila: 74 links

This isn't what I expected ... a possible explanation for this is that more MT blogs than Blogger blogs ping weblogs.com, so I've ended up including more of them in here.
... more like this: [, , , , , ]