Supervised OCR Error Detection and Correction Using Statistical and Neural Machine Translation Methods

For indexing the content of digitized historical texts, optical character recognition (OCR) errors are a hampering problem. To explore the effectivity of new strategies for OCR post-correction, this article focuses on methods of character-based machine translation, specifically neural machine translation and statistical machine translation. Using the ICDAR 2017 data set on OCR post-correction for English and French, we experiment with different strategies for error detection and error correction. We analyze how OCR post-correction with NMT can profit from using additional information and show that SMT and NMT can benefit from each other for these tasks. An ensemble of our models reached best performance in ICDAR’s 2017 error correction subtask and performed competitively in error detection. However, our experimental results also suggest that tuning supervised learning for OCR post-correction of texts from different sources, text types (periodicals and monographs), time periods and languages is a difficult task: the data on which the MT systems are trained have a large influence on which methods and features work best. Conclusive and generally applicable insights are hard to achieve.

[1]  Quoc V. Le,et al.  Addressing the Rare Word Problem in Neural Machine Translation , 2014, ACL.

[2]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[3]  Michael Piotrowski,et al.  Natural Language Processing for Historical Texts , 2012, Synthesis Lectures on Human Language Technologies.

[4]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[5]  Iryna Gurevych,et al.  Still not there? Comparing Traditional Sequence-to-Sequence Models to Encoder-Decoder Neural Networks on Monotone String Translation Tasks , 2016, COLING.

[6]  Geoffrey Zweig,et al.  Sequence-to-sequence neural net models for grapheme-to-phoneme conversion , 2015, INTERSPEECH.

[7]  Alexander Mehler,et al.  A Comparison of Four Character-Level String-to-String Translation Models for (OCR) Spelling Error Correction , 2016, Prague Bull. Math. Linguistics.

[8]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[9]  Yoshua Bengio,et al.  A Character-level Decoder without Explicit Segmentation for Neural Machine Translation , 2016, ACL.

[10]  Mickaël Coustaty,et al.  ICDAR2017 Competition on Post-OCR Text Correction , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[11]  Hua Wu,et al.  Improved Neural Machine Translation with SMT Features , 2016, AAAI.

[12]  Jason Lee,et al.  Fully Character-Level Neural Machine Translation without Explicit Segmentation , 2016, TACL.

[13]  Zhihua Zhang,et al.  An Efficient Character-Level Neural Machine Translation , 2016, ArXiv.

[14]  Holger Schwenk,et al.  OCR Error Correction Using Statistical Machine Translation , 2016, Int. J. Comput. Linguistics Appl..

[15]  Rico Sennrich,et al.  Linguistic Input Features Improve Neural Machine Translation , 2016, WMT.

[16]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[17]  Thomas Breuel,et al.  Sequence-to-sequence neural network models for transliteration , 2016, ArXiv.

[18]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[19]  Martin Reynaert OCR Post-Correction Evaluation of Early Dutch Books Online - Revisited , 2016, LREC.

[20]  Christopher D. Manning,et al.  Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models , 2016, ACL.

[21]  Daniel Jurafsky,et al.  Neural Language Correction with Character-Based Attention , 2016, ArXiv.

[22]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[23]  Jörg Tiedemann,et al.  An SMT Approach to Automatic Annotation of Historical Text , 2013 .

[24]  Philip Koehn,et al.  Statistical Machine Translation , 2010, EAMT.

[25]  Eric Brill,et al.  An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.

[26]  Rico Sennrich,et al.  Nematus: a Toolkit for Neural Machine Translation , 2017, EACL.

[27]  Philipp Koehn,et al.  Neural Machine Translation , 2017, ArXiv.

[28]  Anders Søgaard,et al.  Improving historical spelling normalization with bi-directional LSTMs and multi-task learning , 2016, COLING.

[29]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[30]  Miikka Silfverberg,et al.  Data-Driven Spelling Correction using Weighted Finite-State Methods , 2016, ACL 2016.

[31]  Andy Way,et al.  Using SMT for OCR Error Correction of Historical Texts , 2016, LREC.

[32]  Jonas Kuhn,et al.  Multi-modular domain-tailored OCR post-correction , 2017, EMNLP.

[33]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[34]  Rico Sennrich,et al.  Strategies for Reducing and Correcting OCR Errors , 2011, Language Technology for Cultural Heritage.

[35]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.