Normalizing historical orthography for OCR historical documents using LSTM

Historical text presents numerous challenges for contemporary different techniques, e.g. information retrieval, OCR and POS tagging. In particular, the absence of consistent orthographic conventions in historical text presents difficulties for any system which requires reference to a fixed lexicon accessed by orthographic form. For example, language modeling or retrieval engine for historical text which is produced by OCR systems, where the spelling of words often differ in various way, e.g. one word might have different spellings evolved over time. It is very important to aid those techniques with the rules for automatic mapping of historical wordforms. In this paper, we propose a new technique to model the target modern language by means of a recurrent neural network with long-short term memory architecture. Because the network is recurrent, the considered context is not limited to a fixed size especially due to memory cells which are designed to deal with long-term dependencies. In the set of experiments conducted on the Luther bible database and transform wordforms from Early New High German (ENHG) 14th - 16th centuries to the corresponding modern wordforms in New High German (NHG). We compare our proposed supervised model LSTM to various methods for computing word alignments using statistical, heuristic models. Our new proposed LSTM outperforms the other three state-of-the-art methods. The evaluation shows the accuracy of our model on the known wordforms is 93.90% and on the unknown wordforms is 87.95%, while the accuracy of the existing state-of-the-art combined approach of the wordlist-based and rule-based normalization models is 92.93% for known and 76.88% for unknown tokens. Our proposed LSTM model outperforms on normalizing the modern wordform to historical wordform. The performance on seen tokens is 93.4%, while for unknown tokens is 89.17%.

[1]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[2]  T. Munich,et al.  Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks , 2008, NIPS.

[3]  Rafael Llobet,et al.  OCR Post-processing Using Weighted Finite-State Transducers , 2010, 2010 20th International Conference on Pattern Recognition.

[4]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[5]  Mehryar Mohri Edit-distance of weighted automata , 2002, CIAA'02.

[6]  Paul Rayson,et al.  VARD2 : a tool for dealing with spelling variation in historical corpora , 2008 .

[7]  Iris Hendrickx,et al.  From Old Texts to Modern Spellings: An Experiment in Automatic Normalisation , 2011, J. Lang. Technol. Comput. Linguistics.

[8]  Klaus U. Schulz,et al.  Unsupervised Learning of Edit Distance Weights for Retrieving Historical Spelling Variations , 2007 .

[9]  Bryan Jurish,et al.  More than Words: Using Token Context to Improve Canonicalization of Historical German , 2010, J. Lang. Technol. Comput. Linguistics.

[10]  Stefanie Dipper,et al.  Applying Rule-Based Normalization to Different Types of Historical Texts - An Evaluation , 2011, LTC.

[11]  Jürgen Schmidhuber,et al.  LSTM recurrent networks learn simple context-free and context-sensitive languages , 2001, IEEE Trans. Neural Networks.

[12]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[13]  Stefanie Dipper,et al.  Rule-Based Normalization of Historical Texts , 2011 .

[14]  Norbert Fuhr,et al.  Generating Search Term Variants for Text Collections with Historic Spellings , 2006, ECIR.

[15]  Fabienne Braune,et al.  Improved Unsupervised Sentence Alignment for Symmetrical and Asymmetrical Parallel Corpora , 2010, COLING.

[16]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[17]  Lukás Burget,et al.  Empirical Evaluation and Combination of Advanced Language Modeling Techniques , 2011, INTERSPEECH.

[18]  Justin Zobel,et al.  Finding approximate matches in large lexicons , 1995, Softw. Pract. Exp..

[19]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.