论文信息 - Efficient automatic OCR word validation using word partial format derivation and language model

Efficient automatic OCR word validation using word partial format derivation and language model

In this paper we present an OCR validation module, implemented for the System for Preservation of Electronic Resources (SPER) developed at the U.S. National Library of Medicine.1 The module detects and corrects suspicious words in the OCR output of scanned textual documents through a procedure of deriving partial formats for each suspicious word, retrieving candidate words by partial-match search from lexicons, and comparing the joint probabilities of N-gram and OCR edit transformation corresponding to the candidates. The partial format derivation, based on OCR error analysis, efficiently and accurately generates candidate words from lexicons represented by ternary search trees. In our test case comprising a historic medico-legal document collection, this OCR validation module yielded the correct words with 87% accuracy and reduced the overall OCR word errors by around 60%.

[1] Robert Sedgewick,et al. Fast algorithms for sorting and searching strings , 1997, SODA '97.

[2] William J. Byrne,et al. A Generative Probabilistic OCR Model for NLP Applications , 2003, NAACL.

[4] Xiang Tong,et al. A Statistical Approach to Automatic OCR Error Correction in Context , 1996, VLC@COLING.

[5] Kazem Taghva,et al. Evaluation of model-based retrieval effectiveness with OCR text , 1996, TOIS.

[6] Stephen V. Rice,et al. Measuring the accuracy of page-reading systems , 1996 .

[7] Karen Kukich,et al. Techniques for automatically correcting words in text , 1992, CSUR.

[8] Daniel X. Le,et al. Pattern matching techniques for correcting low-confidence OCR words in a known context , 2000, IS&T/SPIE Electronic Imaging.

[9] Song Mao,et al. Design of a Digital Library for Early 20th Century Medico-legal Documents , 2006, ECDL.

[10] David G. Stork,et al. Pattern Classification , 1973 .

[11] Kazem Taghva,et al. OCRSpell: an interactive spelling correction system for OCR errors in text , 2001, International Journal on Document Analysis and Recognition.

[12] Philip Resnik,et al. OCR error correction using a noisy channel model , 2002 .

[13] Shigeo Abe DrEng. Pattern Classification , 2001, Springer London.

[14] Thomas A. Lasko,et al. Approximate string matching algorithms for limited-vocabulary OCR output correction , 2000, IS&T/SPIE Electronic Imaging.