Post-OCR Document Correction with Large Ensembles of Character Sequence-to-Sequence Models

In this paper, we propose a novel method to extend sequenceto-sequence models to accurately process sequences much longer than the ones used during training while being sampleand resource-efficient, supported by thorough experimentation. To investigate the effectiveness of our method, we apply it to the task of correcting documents already processed with Optical Character Recognition (OCR) systems using sequence-tosequence models based on characters. We test our method on nine languages of the ICDAR 2019 competition on post-OCR text correction and achieve a new state-of-the-art performance in five of them. The strategy with the best performance involves splitting the input document in character n-grams and combining their individual corrections into the final output using a voting scheme that is equivalent to an ensemble of a large number of sequence models. We further investigate how to weigh the contributions from each one of the members of this ensemble. Our code for post-OCR correction is shared at https://github.com/jarobyte91/post ocr correction.

[1]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[2]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[3]  Philip Resnik,et al.  OCR error correction using a noisy channel model , 2002 .

[4]  Iryna Gurevych,et al.  Still not there? Comparing Traditional Sequence-to-Sequence Models to Encoder-Decoder Neural Networks on Monotone String Translation Tasks , 2016, COLING.

[5]  Mickaël Coustaty,et al.  ICDAR2017 Competition on Post-OCR Text Correction , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[6]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[7]  Graham Neubig,et al.  OCR Post-Correction for Endangered Language Texts , 2020, EMNLP.

[8]  A Two-Step Approach for Automatic OCR Post-Correction , 2020, LATECHCLFL.

[9]  Mickaël Coustaty,et al.  Post-OCR Error Detection by Generating Plausible Candidates , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[10]  Mickaël Coustaty,et al.  ICDAR 2019 Competition on Post-OCR Text Correction , 2017, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[11]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[12]  Raymond Wensley Smith The extraction and recognition of text from multimedia document images , 1987 .

[13]  Simon Clematide,et al.  Supervised OCR Error Detection and Correction Using Statistical and Neural Machine Translation Methods , 2018, J. Lang. Technol. Comput. Linguistics.

[14]  Xiang Tong,et al.  A Statistical Approach to Automatic OCR Error Correction in Context , 1996, VLC@COLING.

[15]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[16]  Stephen V. Rice,et al.  The Fourth Annual Test of OCR Accuracy , 1995 .

[17]  Philip Gage,et al.  A new algorithm for data compression , 1994 .

[18]  Youssef Bassil,et al.  OCR Post-Processing Error Correction Algorithm using Google Online Spelling Suggestion , 2012, ArXiv.

[19]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.