Strategies for Reducing and Correcting OCR Errors

In this paper we describe our efforts in reducing and correcting OCR errors in the context of building a large multilingual heritage corpus of Alpine texts which is based on digitizing the publications of various Alpine clubs. We have already digitized the yearbooks of the Swiss Alpine Club from its start in 1864 until 1995 with more than 75,000 pages resulting in 29 million running words. Since these books have come out continuously, they represent a unique basis for historical, cultural and linguistic research. We used commercial OCR systems for the conversion from the scanned images to searchable text. This poses several challenges. For example, the built-in lexicons of the OCR systems do not cover the 19th century German spelling, the Swiss German spelling variants and the plethora of toponyms that are characteristic of our text genre.We also realized that different OCR systems make different recognition errors. We therefore run two OCR systems over all our scanned pages and merge the output. Merging is especially tricky at spots where both systems result in partially correct word groups. We describe our strategies for reducing OCR errors by enlarging the systems’ lexicons and by two post-correction methods namely merging the output of two OCR systems and auto-correction based on additional lexical resources.

[1]  Martin Volk,et al.  Challenges in Building a Multilingual Alpine Heritage Corpus , 2010, LREC.

[2]  Rose Holley Many Hands Make Light Work : Public Collaborative OCR Text Correction in Australian Historic Newspapers , 2009 .

[3]  Ralf Krestel,et al.  A Semantic Wiki Approach to Cultural Heritage Data Management , 2008 .

[4]  Lars Borin,et al.  Naming the Past: Named Entity and Animacy Recognition in 19th Century Swedish Literature , 2007, LaTeCH@ACL 2007.

[5]  Martin Reynaert,et al.  Non-interactive OCR Post-correction for Giga-Scale Digitization Projects , 2008, CICLing.

[6]  Fredric C. Gey,et al.  Proceedings of LREC , 2010 .

[7]  Christian M. Strohmaier,et al.  Methoden der lexikalischen Nachkorrektur OCR-erfasster Dokumente , 2005 .

[8]  S. V. Rice A report on the accuracy of OCR devices , 1992 .

[9]  Stephen V. Rice,et al.  Measuring the accuracy of page-reading systems , 1996 .

[10]  Rose Holley,et al.  How Good Can It Get? Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs , 2009, D Lib Mag..

[11]  William J. Byrne,et al.  A Generative Probabilistic OCR Model for NLP Applications , 2003, NAACL.

[12]  Eric K. Ringger,et al.  Improving optical character recognition through efficient multiple system alignment , 2009, JCDL '09.

[13]  Ahmad Abdulkader,et al.  Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment , 2009, 2009 10th International Conference on Document Analysis and Recognition.