Probabilistic Indexing and Search for Information Extraction on Handwritten German Parish Records

We endeavor to perform very large scale indexing of an ancient German collection of manuscript parish records. To this end we will compute "probabilistic indexes" (PIs), which are known to allow for very accurate and efficient implementation of (single-)keyword spotting. PIs may become prohibitively large for vast manuscript collections. Therefore we analyze simple index pruning methods to achieve adequate tradeoffs between memory requirements and search performance. We also study how to adequately deal with the large variety of non-ASCII symbols and handwritten word spelling variations (accents, umlauts, etc.) which appear in this kind of historical collections. Finally, and most importantly, since most of the images of the collection we aim to index are handwritten tables, we explore the use of PIs to support structured queries for information extraction from untranscribed handwritten images containing tabular data. Empirical results on a small, but complex and representative dataset extracted from the collection considered confirm the viability and adequateness of the chosen approaches.

[1]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[2]  Andreas Keller,et al.  Lexicon-free handwritten word spotting using character HMMs , 2012, Pattern Recognit. Lett..

[3]  Volkmar Frinken,et al.  A Novel Word Spotting Method Based on Recurrent Neural Networks , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Volkmar Frinken,et al.  HMM word graph based keyword spotting in handwritten document images , 2016, Inf. Sci..

[5]  Joan Puigcerver,et al.  Are Multidimensional Recurrent Layers Really Necessary for Handwritten Text Recognition? , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[6]  Alejandro Héctor Toselli,et al.  Simple and Effective Multi-word Query Spotting in Handwritten Text Images , 2017, IbPRIA.

[7]  Alejandro Héctor Toselli Rossi,et al.  Context-aware lattice based filler approach for key word spotting in handwritten documents , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[8]  Alejandro Héctor Toselli,et al.  Probabilistic interpretation and improvements to the HMM-filler for handwritten keyword spotting , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[9]  Alejandro Héctor Toselli Rossi,et al.  Fast HMM-Filler Approach for Key Word Spotting in Handwritten Documents , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[10]  Alejandro Héctor Toselli,et al.  Preparatory KWS Experiments for Large-Scale Indexing of a Vast Medieval Manuscript Collection in the HIMANIS Project , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[11]  Jean-Michel Renders,et al.  A family of contextual measures of similarity between distributions with application to image retrieval , 2009, CVPR.

[12]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[13]  Matti Pietikäinen,et al.  Adaptive document image binarization , 2000, Pattern Recognit..

[14]  Volkmar Frinken,et al.  Improving HMM-Based Keyword Spotting with Character Language Models , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[15]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[16]  Alejandro Héctor Toselli Rossi,et al.  Two Methods to Improve Confidence Scores for Lexicon-Free Word Spotting in Handwritten Text , 2016, 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR).