E-mail experiment
I haven't had a lot of time to reply to e-mail recently, and a few messages about my web projects have been waiting in my drafts folder for some time. As a bit of an experiment, I'm going to take a leaf from McCusker's book and reply to the fairly public-sounding messages out here on this weblog.
If I've quoted you here and you'd like me to paraphrase you rather than cutting and pasting verbatim, drop me a line and I'll sort it out. I'll do my best to only publish things I don't expect people to mind me publishing.
Today we have a message from Michael Fagan:
I was thinking that perhaps most ITE channels should be multilingual, with checkboxes for what language posts you want to view. That way there wouldn't need to be several channels for the same topic.
Of course, then either the user would have to input what language their post was in, or it would have to be detected. Automatic detection would be better, as I don't think TrackBack pings could include more information.
That's an interesting idea. I'm surprised how many non-English channels have appeared. The French and Spanish bloggers are out in force!
I wonder what would be the best way to implement it. As you say, TrackBack pings don't include a lot of information. I wonder if peeking at the character set of the HTTP request would help.
On the other hand, I bet there has been a bit of research into identifying the language of bits of text. A simple (and probably quite effective) method would be to count up the number of words that fit into an English dictionary, and then for a French dictionary, and Spanish, and so on, and pick the language that matches best. That wouldn't be hard to code, either.
Update: Google leads me to TextCat, a GPL tool that can recognise text in sixty-odd languages. Passing a few sentences in English, French and Spanish through the online demo yielded good results for me. So I guess that answers Michael's question. Now, to find the time to hack it in there ... :-)