OCR Post-Correction Evaluation of Early Dutch Books Online - Revisited

We present further work on evaluation of the fully automatic post-correction of Early Dutch Books Online, a collection of 10,333 18th century books. In prior work we evaluated the new implementation of Text-Induced Corpus Clean-up (TICCL) on the basis of a single book Gold Standard derived from this collection. In the current paper we revisit the same collection on the basis of a sizeable 1020 item random sample of OCR post-corrected strings from the full collection. Both evaluations have their own stories to tell and lessons to teach.

[1]  Niloy Ganguly,et al.  How Difficult is it to Develop a Perfect Spell-checker? A Cross-Linguistic Analysis through Complex Network Approach , 2007, physics/0703198.

[2]  Iris Hendrickx,et al.  Historical spelling normalization. A comparison of two statistical methods : TICCL and VARD2 , 2012 .

[3]  Martin Reynaert Character confusion versus focus word-based correction of spelling and OCR variants in corpora , 2010, International Journal on Document Analysis and Recognition (IJDAR).

[4]  Jesse de Does,et al.  Lexicon-supported OCR of eighteenth century Dutch books: a case study , 2013, Electronic Imaging.

[5]  Martin Reynaert,et al.  All, and only, the Errors: more Complete and Consistent Spelling and OCR-Error Correction Evaluation , 2008, LREC.

[6]  Martin Reynaert On OCR ground truths and OCR post-correction gold standards, tools and formats , 2014, DATeCH '14.

[7]  Ben Hutchinson,et al.  Using the Web for Language Independent Spellchecking and Autocorrection , 2009, EMNLP.

[8]  Antonio Zamora,et al.  Collection and characterization of spelling errors in scientific and scholarly text , 1983, J. Am. Soc. Inf. Sci..

[9]  Eric Brill,et al.  Spelling Correction as an Iterative Process that Exploits the Collective Knowledge of Web Users , 2004, EMNLP.

[10]  Martin Reynaert,et al.  FoLiA: A practical XML Format for Linguistic Annotation - a descriptive and comparative study , 2014, CLIN 2014.

[11]  A.P.J. van den Bosch,et al.  PICCL: Philosophical Integrator of Computational and Corpus Libraries , 2015 .

[12]  Martin Reynaert,et al.  Text Induced Spelling Correction , 2004, COLING.

[13]  Martin Reynaert Synergy of Nederlab and , 2014, LREC.

[14]  Kenneth Ward Church,et al.  A Spelling Correction Program Based on a Noisy Channel Model , 1990, COLING.

[15]  Hennie Brugman,et al.  Nederlab: Towards a Single Portal and Research Environment for Diachronic Dutch Text Corpora , 2016, LREC.