Neural OCR Post-Hoc Correction of Historical Corpora

Abstract Optical character recognition (OCR) is crucial for a deeper access to historical collections. OCR needs to account for orthographic variations, typefaces, or language evolution (i.e., new letters, word spellings), as the main source of character, word, or word segmentation transcription errors. For digital corpora of historical prints, the errors are further exacerbated due to low scan quality and lack of language standardization. For the task of OCR post-hoc correction, we propose a neural approach based on a combination of recurrent (RNN) and deep convolutional network (ConvNet) to correct OCR transcription errors. At character level we flexibly capture errors, and decode the corrected output based on a novel attention mechanism. Accounting for the input and output similarity, we propose a new loss function that rewards the model’s correcting behavior. Evaluation on a historical book corpus in German language shows that our models are robust in capturing diverse OCR transcription errors and reduce the word error rate of 32.3% by more than 89%.

[1]  Andy Way,et al.  Using SMT for OCR Error Correction of Historical Texts , 2016, LREC.

[2]  Yann Dauphin,et al.  A Convolutional Encoder Model for Neural Machine Translation , 2016, ACL.

[3]  José A. R. Fonollosa,et al.  Character-based Neural Machine Translation , 2016, ACL.

[4]  Nizar Habash,et al.  Generalized Character-Level Spelling Error Correction , 2014, ACL.

[5]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[6]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[7]  Iryna Gurevych,et al.  Still not there? Comparing Traditional Sequence-to-Sequence Models to Encoder-Decoder Neural Networks on Monotone String Translation Tasks , 2016, COLING.

[8]  Frank Puppe,et al.  Improving OCR Accuracy on Early Printed Books by Utilizing Cross Fold Training and Voting , 2017, 2018 13th IAPR International Workshop on Document Analysis Systems (DAS).

[9]  Eric K. Ringger,et al.  Combining multiple thresholding binarization values to improve OCR output , 2013, Electronic Imaging.

[10]  Daniel Jurafsky,et al.  Neural Language Correction with Character-Based Attention , 2016, ArXiv.

[11]  Markus Dreyer,et al.  Latent-Variable Modeling of String Transductions with Finite-State Methods , 2008, EMNLP.

[12]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[13]  Adrien Barbaresi Bootstrapped OCR error detection for a less-resourced language variant , 2016, KONVENS.

[14]  Jonas Kuhn,et al.  Multi-modular domain-tailored OCR post-correction , 2017, EMNLP.

[15]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[16]  Gholamreza Haffari,et al.  Incorporating Structural Alignment Biases into an Attentional Neural Translation Model , 2016, NAACL.

[17]  David A. Smith,et al.  Multi-Input Attention for Unsupervised OCR Correction , 2018, ACL.

[18]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[19]  David A. Smith,et al.  Retrieving and Combining Repeated Passages to Improve OCR , 2017, 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL).

[20]  Ziqi Wang,et al.  A Probabilistic Approach to String Transformation , 2014, IEEE Transactions on Knowledge and Data Engineering.

[21]  Mark Steedman,et al.  Character-Level Models versus Morphology in Semantic Role Labeling , 2018, ACL.

[22]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[23]  Phil Blunsom,et al.  Recurrent Continuous Translation Models , 2013, EMNLP.

[24]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[25]  Eric Brill,et al.  An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.

[26]  Yoshua Bengio,et al.  A Character-level Decoder without Explicit Segmentation for Neural Machine Translation , 2016, ACL.

[27]  Noah A. Smith,et al.  Improved Transition-based Parsing by Modeling Characters instead of Words with LSTMs , 2015, EMNLP.

[28]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[29]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[30]  Miikka Silfverberg,et al.  Data-Driven Spelling Correction using Weighted Finite-State Methods , 2016, ACL 2016.

[31]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[32]  Eric K. Ringger,et al.  How well does multiple OCR error correction generalize? , 2013, Electronic Imaging.

[33]  Frank Puppe,et al.  State of the Art Optical Character Recognition of 19th Century Fraktur Scripts using Open Source Engines , 2018, DHd.

[34]  Eric K. Ringger,et al.  Progressive Alignment and Discriminative Error Correction for Multiple OCR Engines , 2011, 2011 International Conference on Document Analysis and Recognition.