Improving Book OCR by Adaptive Language and Image Models

In order to cope with the vast diversity of book content and typefaces, it is important for OCR systems to leverage the strong consistency within a book but adapt to variations across books. We describe a system that combines two parallel correction paths using document-specific image and language models. Each model adapts to shapes and vocabularies within a book to identify inconsistencies as correction hypotheses, but relies on the other for effective cross-validation. Using the open source Tesseract engine as baseline, results on a large data set of scanned books demonstrate that word error rates can be reduced by 25 percent using this approach.

[1]  Xiang Tong,et al.  A Statistical Approach to Automatic OCR Error Correction in Context , 1996, VLC@COLING.

[2]  R. Smith,et al.  An Overview of the Tesseract OCR Engine , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[3]  Lucian Vlad Lita,et al.  tRuEcasIng , 2003, ACL.

[4]  C. V. Jawahar,et al.  Recognition of books by verification and retraining , 2008, 2008 19th International Conference on Pattern Recognition.

[5]  Tin Kam Ho,et al.  OCR with no shape training , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[6]  Philip Resnik,et al.  OCR error correction using a noisy channel model , 2002 .

[7]  Erik G. Learned-Miller,et al.  Learning on the Fly: Font-Free Approaches to Difficult OCR Problems , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[8]  Luc Vincent,et al.  Google Book Search: Document Understanding on a Massive Scale , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[9]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[10]  George Nagy,et al.  Adaptive Classifiers for Multi-Source OCR , 2003 .

[11]  George Nagy,et al.  Style consistent classification of isogenous patterns , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Erik G. Learned-Miller,et al.  Improving state-of-the-art OCR through high-precision document-specific modeling , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[13]  George Nagy,et al.  Adaptive classifiers for multisource OCR , 2003, Document Analysis and Recognition.

[14]  Henry S. Baird,et al.  Towards Whole-Book Recognition , 2008, 2008 The Eighth IAPR International Workshop on Document Analysis Systems.