Использование Марковские модели для преобразования заглавных букв в смешанные и родственные задачи

I've been thinking about using Markov techniques to restore missing information to natural language text.

  • Restore all-caps text to mixed-case.
  • Restore accents / diacritics to languages which should have them but have been converted to plain ASCII.
  • Convert rough phonetic transcriptions back into native alphabets.

That seems to be in order of least difficult to most difficult. Basically the problem is resolving ambiguities based on context.

I can use Wiktionary as a dictionary and Wikipedia as a corpus using n-grams and Hidden Markov Models to resolve the ambiguities.

Am I on the right track? Are there already some services, libraries, or tools for this sort of thing?

Examples

  • GEORGE LOST HIS SIM CARD IN THE BUSH ⇨ George lost his SIM card in the bush
  • tantot il rit a gorge deployee ⇨ tantôt il rit à gorge déployée
5
задан Brock Adams 6 August 2011 в 16:27
поделиться