论文信息 - OCR Post-Correction Evaluation of Early Dutch Books Online - Revisited

OCR Post-Correction Evaluation of Early Dutch Books Online - Revisited

We present further work on evaluation of the fully automatic post-correction of Early Dutch Books Online, a collection of 10,333 18th century books. In prior work we evaluated the new implementation of Text-Induced Corpus Clean-up (TICCL) on the basis of a single book Gold Standard derived from this collection. In the current paper we revisit the same collection on the basis of a sizeable 1020 item random sample of OCR post-corrected strings from the full collection. Both evaluations have their own stories to tell and lessons to teach.

Martin Reynaert

[1] Niloy Ganguly,et al. How Difficult is it to Develop a Perfect Spell-checker? A Cross-Linguistic Analysis through Complex Network Approach , 2007, physics/0703198.

[2] Iris Hendrickx,et al. Historical spelling normalization. A comparison of two statistical methods : TICCL and VARD2 , 2012 .

[3] Martin Reynaert. Character confusion versus focus word-based correction of spelling and OCR variants in corpora , 2010, International Journal on Document Analysis and Recognition (IJDAR).

[4] Jesse de Does,et al. Lexicon-supported OCR of eighteenth century Dutch books: a case study , 2013, Electronic Imaging.

[5] Martin Reynaert,et al. All, and only, the Errors: more Complete and Consistent Spelling and OCR-Error Correction Evaluation , 2008, LREC.

[6] Martin Reynaert. On OCR ground truths and OCR post-correction gold standards, tools and formats , 2014, DATeCH '14.

[7] Ben Hutchinson,et al. Using the Web for Language Independent Spellchecking and Autocorrection , 2009, EMNLP.

[8] Antonio Zamora,et al. Collection and characterization of spelling errors in scientific and scholarly text , 1983, J. Am. Soc. Inf. Sci..

[9] Eric Brill,et al. Spelling Correction as an Iterative Process that Exploits the Collective Knowledge of Web Users , 2004, EMNLP.

[10] Martin Reynaert,et al. FoLiA: A practical XML Format for Linguistic Annotation - a descriptive and comparative study , 2014, CLIN 2014.

[11] A.P.J. van den Bosch,et al. PICCL: Philosophical Integrator of Computational and Corpus Libraries , 2015 .

[12] Martin Reynaert,et al. Text Induced Spelling Correction , 2004, COLING.

[13] Martin Reynaert. Synergy of Nederlab and , 2014, LREC.

[14] Kenneth Ward Church,et al. A Spelling Correction Program Based on a Noisy Channel Model , 1990, COLING.

[15] Hennie Brugman,et al. Nederlab: Towards a Single Portal and Research Environment for Diachronic Dutch Text Corpora , 2016, LREC.