Adapting the Tesseract open source OCR engine for multilingual OCR

We describe efforts to adapt the Tesseract open source OCR engine for multiple scripts and languages. Effort has been concentrated on enabling generic multi-lingual operation such that negligible customization is required for a new language beyond providing a corpus of text. Although change was required to various modules, including physical layout analysis, and linguistic post-processing, no change was required to the character classifier beyond changing a few limits. The Tesseract classifier has adapted easily to Simplified Chinese. Test results on English, a mixture of European languages, and Russian, taken from a random sample of books, show a reasonably consistent word error rate between 3.72% and 5.78%, and Simplified Chinese has a character error rate of only 3.77%.

[1]  G. Nagy,et al.  Chinese character recognition: a twenty-five-year retrospective , 1988, [1988 Proceedings] 9th International Conference on Pattern Recognition.

[2]  Ranjith Unnikrishnan,et al.  Combined script and page orientation estimation using the Tesseract OCR engine , 2009, MOCR '09.

[3]  Shumeet Baluja,et al.  Learning to hash: forgiving hash functions and applications , 2008, Data Mining and Knowledge Discovery.

[4]  Venu Govindaraju,et al.  Tools for enabling digital access to multi-lingual Indic documents , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[5]  Franck Xia Knowledge-based sub-pattern segmentation: decompositions of Chinese characters , 1994, Proceedings of 1st International Conference on Image Processing.

[6]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[7]  R. Smith A simple and efficient skew detection algorithm via text row accumulation , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[8]  Raymond W. Smith Hybrid Page Layout Analysis via Tab-Stop Detection , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[9]  Veena Bansal,et al.  A complete OCR for printed Hindi text in Devanagari script , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[10]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[11]  Richard M. Schwartz,et al.  Advances in the BBN BYBLOS OCR system , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[12]  R. Smith,et al.  An Overview of the Tesseract OCR Engine , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[13]  Tapas Kanungo,et al.  OmniPage vs. Sakhr: paired model evaluation of two Arabic OCR products , 1999, Electronic Imaging.