Phillip Pearson - web + electronics notes

tech notes and web hackery from a new zealander who was vaguely useful on the web back in 2002 (see: python community server, the blogging ecosystem, the new zealand coffee review, the internet topic exchange).

2003-10-3

Topics: flat or hierarchical?

Robert Barksdale asks, re the Topic Exchange:

> Has anyone asked about creating sub-topics under the different topics?
> I was thinking about sub-topics under Mozilla like XUL, Gecko DOM, RDF,
> and Javascript. Is that possible?


That's not possible at the moment, but I'd argue that it's not necessary. Are those really sub-topics of Mozilla, or are they standalone topics? I'd say that XUL, RDF and JavaScript aren't tied to Mozilla, so you could make them root-level topics. The Gecko DOM is a better fit, but even then, the Gecko engine runs outside Mozilla too, so you could call it a root-level topic.

I lean towards the idea of considering all topics as equal when created, and categorising later, rather than trying to fit everything into one scheme at the start and sticking with it. And everyone categorises things differently, anyway. Coming from a programming background, I'd put XUL and RDF under "xml formats" and JavaScript under "programming languages", but Robert, as a user, sees them as sub-topics of Mozilla.

So go ahead - create topics as you like, and see how they turn out! These are still early days for the Exchange -- experiment, and see what works best :-)
... more like this: []

By the way

I'm experimenting with a new linkblog: Crash.

Here's today's archive.
... more like this: []

Post-parsing thoughts

OK, now I've shown how to parse RSS-Data and how to parse namespaced RSS extensions in Python. I'm still not convinced that it's a good idea. Parsing the RSS-Data was just barely easier than parsing the straight XML, but only because xmlrpclib's return codes are more familiar to me than elementtree's tree objects. I would argue that the time spent on hooking the XML-RPC and RSS parser in your language of choice together would be better spent on writing an XML library like elementtree.

There's a good discussion going on re my original post.

---

Roger Benningfield claims that the RSS-Data one is more useful by default than the straight XML one, but I'd argue with this. For example, here are two bits of XML, which I assert are equally useful without any specific support from an aggregator:

1. RSS-Data

<x:container xmlns:x="http://www.myelin.co.nz/ns/x">
 <sdl:struct>
  <sdl:member>
   <sdl:name>foo</sdl:name>
   <sdl:value><sdl:string>bar</sdl:string></sdl:value>
  </sdl:member>
 </sdl:struct>
</x:container>

This deserialises to {'foo': 'bar'} in Python, and we know that instructions for understanding this information are at http://www.myelin.co.nz/ns/x.

2. Straight XML

<x:container xmlns:x="http://www.myelin.co.nz/ns/x">
 <x:foo>bar</x:foo>
</x:container>

This parses into an elementtree which is equivalent to {'foo': 'bar'>, and we still know that instructions for understanding this information are at http://www.myelin.co.nz/ns/x.

---

Georg Bauer suggests a sensible application for RSS-Data: using the RSS feed to push generic data (that the RSS feed generator doesn't understand) back to a periodically-connected client (like Radio, or Georg's PyDS) to process.

This sounds very sensible -- in this case, you can't use a well-specified XML encoding, because the RSS writer doesn't have a clue what it's writing.

When you are talking about well-specified data -- say, recipes, reviews, events, or music -- however, there's no necessity to use a general encoding like this.
... more like this: [, ]

Parsing namespaced RSS extensions

OK, I've shown you how to parse RSS-Data. Now here's how to parse Les Orchard's other example: a namespaced RSS extension. Here's the code:

import re, urllib, xmlrpclib, os.path
from elementtree import ElementTree as et

# read les's example
html = urllib.urlopen('http://www.decafbad.com/blog/tech/rss_data_versus_namespace.html').read()

# turn the html-quoted example back into xml
for entity, char in (('lt', '<'), ('gt', '>'), ('amp', '&')):
    html = html.replace('&%s;' % entity, char)

# rip out the Amazon bit and get rid of the namespace
xml = re.search(r'(\<az\:ProductInfo\>.*\</az\:ProductInfo\>)', html, re.S).group(1)
xml = '<?xml version="1.0" encoding="iso-8859-1"?>' + xml.replace('az:', '')

book = et.XML(xml).find('Details')

# and we have the data!
et.dump(book)

# give out some details
authors = book.find('Authors')._children
print ("---\n%s, by %s" % (book.find('ProductName').text.strip(),
                           " and ".join([auth.text for auth in authors]),)
       ).encode('latin-1')
print "\nList %s, Amazon %s (Used %s)" % tuple(
    [book.find(x).text for x in ('ListPrice', 'OurPrice', 'UsedPrice')])
print "URL: %s" % book.get('url')


And here are the results:

phil@icefloe:~/projects/rss-data$ python parse-les-orchard-ns-example.py
<Details url="http://www.amazon.com/exec/obidos/ASIN/0439139597/0xdecafbad-20">
          <Asin>0439139597</Asin>
          <ProductName>Harry Potter and the Goblet of Fire (Book 4)</ProductName>
          <Catalog>Book</Catalog>
          <Authors>
            <Author>J. K. Rowling</Author>
            <Author>Mary GrandPr&#142;</Author>
          </Authors>
          <ReleaseDate>08 July, 2000</ReleaseDate>
          <Manufacturer>Scholastic</Manufacturer>
          <ImageUrlSmall>http://images.amazon.com/images/P/0439139597.01.THUMBZZZ.jpg</ImageUrlSmall>
          <ImageUrlMedium>http://images.amazon.com/images/P/0439139597.01.MZZZZZZZ.jpg</ImageUrlMedium>
          <ImageUrlLarge>http://images.amazon.com/images/P/0439139597.01.LZZZZZZZ.jpg</ImageUrlLarge>
          <Availability>Usually ships within 24 hours</Availability>
          <ListPrice>$25.95</ListPrice>
          <OurPrice>$18.16</OurPrice>
          <UsedPrice>$3.97</UsedPrice>
        </Details>
      ---
Harry Potter and the Goblet of Fire (Book 4), by J. K. Rowling and Mary GrandPr▒

List $25.95, Amazon $18.16 (Used $3.97)
URL: http://www.amazon.com/exec/obidos/ASIN/0439139597/0xdecafbad-20

... more like this: [, ]

Parsing RSS-Data

As a companion to Les Orchard's RSS-Data versus namespace examples, here's some Python code that will parse the RSS-Data version:

import re, urllib, xmlrpclib, os.path
from pprint import pprint

# read les's example
html = urllib.urlopen('http://www.decafbad.com/blog/tech/rss_data_versus_namespace.html').read()

# turn the html-quoted example back into xml
for entity, char in (('lt', '<'), ('gt', '>'), ('amp', '&')):
    html = html.replace('&%s;' % entity, char)

# rip out the rss-data bit and get rid of the namespace
xml = re.search(r'(\<sdl\:data\>.*\</sdl\:data\>)', html, re.S).group(1)
xml = xml.replace('sdl:', '')

# feed it through xmlrpclib
p, u = xmlrpclib.getparser()
p.feed(xml)
p.close()
book = u._stack[0]

# and we have the data!
pprint(book)

# give out some details
print "---\n%s, by %s" % (book['ProductName'].strip(), " and ".join(book['Authors']))
print "\nList %(ListPrice)s, Amazon %(OurPrice)s (Used %(UsedPrice)s)\nURL: %(url)s" % book


Here's what you get when you run it:

phil@icefloe:~/projects/rss-data$ python parse-les-orchard-sdl-example.py
{'Asin': '0439139597',
 'Authors': ['J. K. Rowling', 'Mary GrandPr'],
 'Availability': 'Usually ships within 24 hours',
 'Catalog': 'Book',
 'ImageUrlLarge': 'http://images.amazon.com/images/P/0439139597.01.LZZZZZZZ.jpg',
 'ImageUrlMedium': 'http://images.amazon.com/images/P/0439139597.01.MZZZZZZZ.jpg',
 'ImageUrlSmall': 'http://images.amazon.com/images/P/0439139597.01.THUMBZZZ.jpg',
 'ListPrice': '$25.95',
 'Manufacturer': 'Scholastic',
 'OurPrice': '$18.16',
 'ProductName': '\n Harry Potter and the Goblet of Fire (Book 4)\n ',
 'ReleaseDate': <DateTime 2000-07-08T00:00:00 at 818012c>,
 'UsedPrice': '$3.97',
 'url': 'http://www.amazon.com/exec/obidos/ASIN/0439139597/0xdecafbad-20'}
---
Harry Potter and the Goblet of Fire (Book 4), by J. K. Rowling and Mary GrandPr

List $25.95, Amazon $18.16 (Used $3.97)
URL: http://www.amazon.com/exec/obidos/ASIN/0439139597/0xdecafbad-20

... more like this: [, ]