Robust Recognition of Degraded Documents Using Character N-Grams

In this paper we present a novel recognition approach that results in a 15% decrease in word error rate on heavily degraded Indian language document images. OCRs have considerably good performance on good quality documents, but fail easily in presence of degradations. Also, classical OCR approaches perform poorly over complex scripts such as those for Indian languages. We address these issues by proposing to recognize character n-gram images, which are basically groupings of consecutive character/component segments. Our approach is unique, since we use the character n-grams as a primitive for recognition rather than for post processing. By exploiting the additional context present in the character n-gram images, we enable better disambiguation between confusing characters in the recognition phase. The labels obtained from recognizing the constituent n-grams are then fused to obtain a label for the word that emitted them. Our method is inherently robust to degradations such as cuts and merges which are common in digital libraries of scanned documents. We also present a reliable and scalable scheme for recognizing character n-gram images. Tests on English and Malayalam document images show considerable improvement in recognition in the case of heavily degraded documents.

[1]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[2]  Theodosios Pavlidis,et al.  On the Recognition of Printed Characters of Any Font and Size , 1987, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  C. V. Jawahar,et al.  Nearest neighbor based collection OCR , 2010, DAS '10.

[4]  Premkumar Natarajan,et al.  The BBN Byblos Hindi OCR system , 2005, IS&T/SPIE Electronic Imaging.

[5]  Richard M. Schwartz,et al.  Advances in the BBN BYBLOS OCR system , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[6]  W. Bruce Croft,et al.  Probabilistic Retrieval of OCR Degraded Text Using N-Grams , 1997, ECDL.

[7]  Chandan Singh,et al.  A shape based post processor for Gurmukhi OCR , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[8]  C. V. Jawahar,et al.  Experiences of integration and performance testing of multilingual OCR for printed Indian scripts , 2011, MOCR_AND '11.

[9]  R. Manmatha,et al.  Word spotting for historical documents , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[10]  Richard M. Schwartz,et al.  An Omnifont Open-Vocabulary OCR System for English and Arabic , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Alan F. Smeaton,et al.  Word matching using single closed contours for indexing handwritten historical documents , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[12]  Gerhard Rigoll,et al.  Improved degraded document recognition with hybrid modeling techniques and character n-grams , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[13]  David G. Lowe,et al.  Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration , 2009, VISAPP.

[14]  Kazem Taghva,et al.  OCRSpell: an interactive spelling correction system for OCR errors in text , 2001, International Journal on Document Analysis and Recognition.

[15]  Mohammad S. Khorsheed,et al.  Off-Line Arabic Character Recognition – A Review , 2002, Pattern Analysis & Applications.

[16]  Ching Y. Suen,et al.  Retrieving poorly degraded OCR documents , 2005, International Journal of Document Analysis and Recognition (IJDAR).

[17]  C. V. Jawahar,et al.  Character n-Gram Spotting in Document Images , 2011, 2011 International Conference on Document Analysis and Recognition.

[18]  Premkumar Natarajan,et al.  The BBN Byblos Hindi OCR System , 2009 .

[19]  Venu Govindaraju,et al.  Challenges in OCR of Devanagari documents , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[20]  Ray Smith Limits on the Application of Frequency-Based Language Models to OCR , 2011, 2011 International Conference on Document Analysis and Recognition.

[21]  Adnan Amin,et al.  Off-line Arabic character recognition: the state of the art , 1998, Pattern Recognit..

[22]  Harish Srinivasan,et al.  Handwritten Arabic Word Spotting using the CEDARABIC Document Analysis System , 2005 .