Phillip Pearson - web + electronics notes

tech notes and web hackery from a new zealander who was vaguely useful on the web back in 2002 (see: python community server, the blogging ecosystem, the new zealand coffee review, the internet topic exchange).

2006-5-24

Encoding of non-ascii characters in URLs

Today I've been subjecting the PeopleAggregator API implementation to the 'Sam Ruby Iñtërnâtiônàlizætiøn test'. It went in and out just fine through XML-RPC, but the REST methods caused a bit more trouble. All sorted out now, but...

It turns out that Firefox, at least on my dev machine, encodes URLs as ISO-8859-1 (or perhaps Windows-1252), whereas Internet Explorer encodes them as UTF-8. I was trying to use PHP's mb_convert_encoding function to convert this, but it was just ignoring any non-ASCII chars.

The interesting thing about non-ascii chars in URLs and POSTDATA is that the browsers don't seem to send any indication of the charset used. Whether the content is UTF-8 or ISO-8859-1, all I get is "Content-Type: application/x-www-form-urlencoded". It would be nice to have "; charset=UTF-8" at the end, but it doesn't seem like I'm that lucky!

As a results of this, I've reduced the scope - PeopleAggregator will support UTF-8 and ISO-8859-1, with UTF-8 strongly preferred.

For Frontier's benefit, it will handle XML-RPC requests that pretend to be UTF-8 but are actually ISO-8859-1.

... more like this: []