OCR error correction algorithms

I'm working on digitizing a large collection of scanned documents, working with Tesseract 3 as my OCR engine. The quality of its output is mediocre, as it often produces both garbage characters before and after the actual text, and misspellings within the text.

For the former problem, it seems like there must be strategies for determining which text is actually text and which text isn't (much of this text is things like people's names, so I'm looking for solutions other than looking up words in a dictionary).

For the typo problem, most of the errors stem from a few misclassifications of letters (substituting l, 1, and I for one another, for instance), and it seems like there should be methods for guessing which words are misspelled (since not too many words in English have a "1" in the middle of them), and guessing what the appropriate correction is.

What are the best practices in this space? Are there free/open-source implementations of algorithms that do this sort of thing? Google has yielded lots of papers, but not much concrete. If there aren't implementations available, which of the many papers would be a good starting place?

21
задан random 7 August 2012 в 17:02
поделиться