Multilingual spell checking with language detection

I'm working on spell checking of mixed language webpages, and haven't been able to find any existing research on the subject.

The aim is to automatically detect language at a sentence level within mixed language webpages and spell check each against their appropriate language automatically. Assume that we can ignore sentences which mix multiple languages together (e.g. "He has a certain je ne sais quoi"), and assume webpages can't contain more than 2 or 3 languages.

Trivial example (Welsh + English): http://wales.gov.uk/

I'm currently using a mix of:

  • Character distribution (e.g. 0600-06FF = Arabic etc)
  • n-Grams to discern languages with similar characters
  • Dictionary lookup to discern locale, i.e. en-US, en-GB

I have working code but am concerned it may be naive or needlessly re-inventing a wheel. Has anyone else done this before?

7
задан Oliver Emberton 3 May 2011 в 17:54
поделиться