Digitised historical text: Does it have to be mediOCRe?

This paper reports on experiments to improve the Optical Character Recognition (ocr) quality of historical text as a preliminary step in text mining. We analyse the quality of ocred text compared to a gold standard and show how it can be improved by performing two automatic correction steps. We also demonstrate the impact this can have on named entity recognition in a preliminary extrinsic evaluation. This work was performed as part of the Trading Consequences project which is focussed on text mining of historical documents for the study of nineteenth century trade in the British Empire.

[1]  Rada Mihalcea,et al.  Mapping Texts: Combining Text-Mining and Geo-Visualization To Unlock The Research Potential of Historical Newspapers , 2011 .

[2]  Daniel P. Lopresti Performance evaluation for text processing of noisy inputs , 2005, SAC '05.

[3]  Klaus U. Schulz,et al.  Information Access to Historical Documents from the Early New High German Period , 2006, Digital Historical Corpora.

[4]  Daniel P. Lopresti Optical character recognition errors and their effects on natural language processing , 2009, International Journal on Document Analysis and Recognition (IJDAR).

[5]  Asaf Tzadok,et al.  User Collaboration for Improving Access to Historical Texts , 2010 .

[6]  Eugene W. Myers,et al.  A file comparison program , 1985, Softw. Pract. Exp..

[7]  Ahmad Abdulkader,et al.  Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[8]  Daniel P. Lopresti Measuring the impact of character recognition errors on downstream text analysis , 2008, Electronic Imaging.

[9]  Rico Sennrich,et al.  Strategies for Reducing and Correcting OCR Errors , 2011, Language Technology for Cultural Heritage.

[10]  Martin Reynaert,et al.  Non-interactive OCR Post-correction for Giga-Scale Digitization Projects , 2008, CICLing.

[11]  Eric K. Ringger,et al.  Improving optical character recognition through efficient multiple system alignment , 2009, JCDL '09.

[12]  Rose Holley Many Hands Make Light Work : Public Collaborative OCR Text Correction in Australian Historic Newspapers , 2009 .

[13]  Philip Resnik,et al.  OCR Post-Processing for Low Density Languages , 2005, HLT/EMNLP.

[14]  Rose Holley,et al.  How Good Can It Get? Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs , 2009, D Lib Mag..

[15]  Philip Resnik,et al.  OCR error correction using a noisy channel model , 2002 .

[16]  O. Morgenthaler,et al.  Proceedings of the Conference , 1930 .

[17]  Youssef Bassil,et al.  OCR Post-Processing Error Correction Algorithm using Google Online Spelling Suggestion , 2012, ArXiv.

[18]  Klaus U. Schulz,et al.  Towards information retrieval on historical document collections: the role of matching procedures and special lexica , 2010, International Journal on Document Analysis and Recognition (IJDAR).

[19]  Esko Ukkonen,et al.  Algorithms for Approximate String Matching , 1985, Inf. Control..

[20]  Klaus U. Schulz,et al.  The same is not the same - postcorrection of alphabet confusion errors in mixed-alphabet OCR recognition , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).