Phillip Pearson - web + electronics notes: Changing the Structured Blogging plugins' XML output

2006-1-12

Changing the Structured Blogging plugins' XML output

One current issue with the Structured Blogging plugins is that they produce HTML that doesn't validate on the W3C validator and feeds that produce warnings on the Feed Validator.

This is because of the method used to embed the structured post's XML source in the HTML output.

How the output looks

The current output looks like this, with the XML source for the post shown in bold:

<script type="application/x-subnode; charset=utf-8">
  <!-- the following is structured blog data for machine readers. -->
  <subnode alternate-for-id="sbentry_5"
      xmlns:data-view="http://www.w3.org/2003/g/data-view#"
      data-view:interpreter="http://structuredblogging.org/subnode-to-rdf-interpreter.xsl"
      xmlns="http://www.structuredblogging.org/xmlns#subnode">
    <xml-structured-blog-entry xmlns="http://www.structuredblogging.org/xmlns">
    <generator id="wpsb-1" type="x-wpsb-post" version="1"/>
    <event type="event/conference">
      <name>Doc's show</name>
      <image>/~phil/sb_latest/images/syndicate_logo.gif</image>
      <person role="organizer" url="http://doc.weblogs.com">Doc Searls</person>
      <description>This is Doc's show.  He organized it, decided what
        panels to have, and he's paying for dinner.</description>
      <tags>doc</tags>
      <begins>2005-12-13T15:57:00</begins>
      <ends>2005-12-13T15:57:00</ends>
    </event>
    </xml-structured-blog-entry>
  </subnode>
</script>

This embedding technique, called x-subnode and invented by the guys at PubSub (I think Bob Wyman and Duncan Werner) when they did the first SB plugin, is pretty clever. Because they don't know about the the application/x-subnode script type, browsers will completely ignore the contents. This means you don't need to enclose the whole thing in a comment to stop it from being displayed. Then, you can just drop the whole thing into an RSS <description> or Atom <content> element and have the structured data flow out through the feed.

Other bits to note:

The alternate-for-id attribute points to an ID earlier in the page which encloses the HTML of this post. This would let a Greasemonkey script reformat the post if it wanted to - or allow a crawler to go back from the structured data to the actual HTML.

The two lines in italics are there to enable GRDDL, which lets RDF people extract meaning from the XML content. This lets us be "RDF compatible" without having to actually generate the RDF.

So, in summary:

It lets you embed XML inside HTML without commenting it out.
The XML is still accessible using an XML parser, so XSLT etc works.
GRDDL tools will be able to turn it into RDF.
It works inside HTML and also inside RSS/Atom, so a separate embedding method isn't required for feeds.

Problems

Unfortunately, using <script> for all this fires off warnings everywhere we go, and pretty much everyone who looks at the embedded data, whether in a web page or in a feed, has a really bad first impression. So, it's time to do something about that.

Here are my thoughts so far.

Tidying the GRDDL stuff

It seems (from reading the GRDDL Team Submission, the GRDDL profile document, and Danny Ayers' explanation on how to make microformats GRDDL-friendly), that the data-view bits needn't appear in the XML when embedded in HTML. If we put a profile for Structured Blogging in the HTML header like this:

<head profile="http://structuredblogging.org/profile">

... then, in the profile page, refer to the data-view profile and point to the SB XSLT file using profileTransformation, this will cause the XSLT file to be run on pages generated by the SB plugin.

Getting the XML out of the page

After setting up the GRDDL profile/transform, we could define a microformat to link to the XML source and move it to another URL. This way an RDF crawler would still pick up on it, while crawlers specifically looking for SB posts could look for the links and work from there.

I'm not quite sure how this should look, but here's one possibility: put a class name (e.g. sb_post) on an element surrounding the post, and inside that element, link to the XML source with rel="sb_source". So the HTML for a post might look like:

<div class="structured_post">
  <h3>This is the post title</h3>
  <p>Here is some text</p>
  <p>(<a rel="alternate" type="application/xml" href="/path/to/xml_source">XML</a>)</p>
</div>

Making the XML more accessible inside feeds

Currently the whole chunk of XML (above) is embedded in the description or content elements in syndication feeds, as part of the encoded HTML. It would look a lot nicer if it could be moved out - perhaps like this:

<item>
  ...
  <description>HTML goes here</description>
  <source xmlns="http://structuredblogging.org/xmlns" url="http://server/path/to/xml_source">
    core XML -- <event> from the first example -- goes here
  </source>
</item>

We could GRDDL-enable this by putting a namespaceTransformation reference in the xmlns document.

Pros and cons of the changes

Making these changes would:

make everything look a lot nicer,
and make everything validate,
while maintaining RDF compatibility.

The downside is:

the XML would no longer be directly available inside the HTML, so a crawler would have to make more HTTP requests,
the XML wouldn't be sent over to other blogs when making remote posts via outputthis.org,
and feed parsers (like the one powering PubSub) would have to be modified to understand the new syntax.

Hmm.

Perhaps the best solution would be to:

Keep publishing the XML source (using x-subnode) in the HTML (and when sending via outputthis.org),
but use profileTransformation to get the data-view attributes out of each subnode block.
Use a sb:source element to include the XML source in feeds (rather than x-subnode).

Update (2005-01-19): I've changed my mind. Linking to the XML like this - with <a rel="alternate"> - is actually more likely to be preserved when sending stuff around (with outputthis or inside a feed). The only issue is that the link doesn't look that great. Perhaps we need an "SB XML source" icon, like RSS's white-on-orange XML icon. I've seen the white-on-orange icon used to mean other things than RSS, but I'm not sure how widespread that is.

... more like this: [Structured Blogging]