On OCR ground truths and OCR post-correction gold standards, tools and formats

We give an overview of activities undertaken in the sidelines of our automatic OCR post-correction core business over the past few years. We present ongoing projects in the Netherlands in which Text-Induced Corpus Clean-up plays a part. We describe the infrastructure we are building to help improve the overall text quality of large digitized text collections. We provide information on the tools we develop to facilitate the process and discuss the role of FoLiA XML which we adopted as a pivot format. Connecting the dots, we discuss the difference we perceive between OCR ground truths and OCR post-correction gold standards and their respective contributions.

[1]  Martin Reynaert,et al.  FoLiA: A practical XML Format for Linguistic Annotation - a descriptive and comparative study , 2014, CLIN 2014.

[2]  Thomas M. Breuel The hOCR Microformat for OCR Workflow and Results , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[3]  Tomasz Parkoła,et al.  Report on the comparison of Tesseract and ABBYY FineReader OCR engines , 2012 .

[4]  W. Bruce Croft,et al.  Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2013 .

[5]  Martin Reynaert Character confusion versus focus word-based correction of spelling and OCR variants in corpora , 2010, International Journal on Document Analysis and Recognition (IJDAR).

[6]  Jesse de Does,et al.  Lexicon-supported OCR of eighteenth century Dutch books: a case study , 2013, Electronic Imaging.

[7]  Antske Fokkens,et al.  Offspring from Reproduction Problems: What Replication Failure Teaches Us , 2013, ACL.

[8]  Martin Reynaert,et al.  All, and only, the Errors: more Complete and Consistent Spelling and OCR-Error Correction Evaluation , 2008, LREC.

[9]  Maarten de Rijke,et al.  Feeding the Second Screen: Semantic Linking based on Subtitles , 2013, DIR.

[10]  R. Manmatha,et al.  A Fast Alignment Scheme for Automatic OCR Evaluation of Books , 2011, 2011 International Conference on Document Analysis and Recognition.

[11]  Iris Hendrickx,et al.  Historical spelling normalization. A comparison of two statistical methods : TICCL and VARD2 , 2012 .

[12]  K. Vis Subjectivity in news discourse : A corpus linguistic analysis of informalization , 2011 .

[13]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[14]  Beatrice Alex,et al.  Digitised historical text: Does it have to be mediOCRe? , 2012, KONVENS.

[15]  Martin Reynaert Synergy of Nederlab and , 2014, LREC.