Information extraction from historical handwritten document images with a context-aware neural model

Abstract Many historical manuscripts that hold trustworthy memories of the past societies contain information organized in a structured layout (e.g. census, birth or marriage records). The precious information stored in these documents cannot be effectively used nor accessed without costly annotation efforts. The transcription driven by the semantic categories of words is crucial for the subsequent access. In this paper we describe an approach to extract information from structured historical handwritten text images and build a knowledge representation for the extraction of meaning out of historical data. The method extracts information, such as named entities, without the need of an intermediate transcription step, thanks to the incorporation of context information through language models. Our system has two variants, the first one is based on bigrams, whereas the second one is based on recurrent neural networks. Concretely, our second architecture integrates a Convolutional Neural Network to model visual information from word images together with a Bidirecitonal Long Short Term Memory network to model the relation among the words. This integrated sequential approach is able to extract more information than just the semantic category (e.g. a semantic category can be associated to a person in a record). Our system is generic, it deals with out-of-vocabulary words by design, and it can be applied to structured handwritten texts from different domains. The method has been validated with the ICDAR IEHHR competition protocol, outperforming the existing approaches.

[1]  Gernot A. Fink,et al.  PHOCNet: A Deep Convolutional Neural Network for Word Spotting in Handwritten Documents , 2016, 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR).

[2]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[3]  Abdel Belaïd,et al.  Recognition-Based Approach of Numeral Extraction in Handwritten Chemistry Documents Using Contextual Knowledge , 2016, 2016 12th IAPR Workshop on Document Analysis Systems (DAS).

[4]  Alicia Fornés,et al.  Handwriting Recognition by Attribute Embedding and Recurrent Neural Networks , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[5]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[6]  Abdel Belaïd,et al.  Separator and content based approach for table extraction in handwritten chemistry documents , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[7]  Fred Saunderson,et al.  Open Licensing for Cultural Heritage , 2017 .

[8]  Wei Chen,et al.  Variable-Span out-of-vocabulary named entity detection , 2013, INTERSPEECH.

[9]  Hermann Ney,et al.  LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[10]  Ernest Valveny,et al.  Word Spotting and Recognition with Embedded Attributes , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Jean-Luc Meunier,et al.  Comparing Machine Learning Approaches for Table Recognition in Historical Register Books , 2018, 2018 13th IAPR International Workshop on Document Analysis Systems (DAS).

[12]  Regina Barzilay,et al.  Improving Information Extraction by Acquiring External Evidence with Reinforcement Learning , 2016, EMNLP.

[13]  Alicia Fornés,et al.  The ESPOSALLES database: An ancient marriage license corpus for off-line handwriting recognition , 2013, Pattern Recognit..

[14]  Alicia Fornés,et al.  Joint Recognition of Handwritten Text and Named Entities with a Neural End-to-End Model , 2018, 2018 13th IAPR International Workshop on Document Analysis Systems (DAS).

[15]  Alicia Fornés,et al.  Handwritten Word Image Categorization with Convolutional Neural Networks and Spatial Pyramid Pooling , 2016, S+SSPR.

[16]  Anders Brun,et al.  Semantic and Verbatim Word Spotting Using Deep Neural Networks , 2016, ICFHR 2016.

[17]  Ole Winther,et al.  CloudScan - A Configuration-Free Invoice Analysis System Using Recurrent Neural Networks , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[18]  Antonella Fresa,et al.  The Digitization Age: Mass Culture Is Quality Culture. Challenges for Cultural Heritage and Society , 2014, EuroMed.

[19]  Núria Cirera,et al.  BH2M: The Barcelona Historical, Handwritten Marriages Database , 2014, 2014 22nd International Conference on Pattern Recognition.

[20]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2015, IEEE Trans. Pattern Anal. Mach. Intell..

[21]  Théodore Bluche,et al.  Joint Line Segmentation and Transcription for End-to-End Handwritten Paragraph Recognition , 2016, NIPS.

[22]  Nihel Kooli,et al.  Inexact graph matching for entity recognition in OCRed documents , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[23]  Alicia Fornés,et al.  On the Influence of Word Representations for Handwritten Word Spotting in Historical Documents , 2012, Int. J. Pattern Recognit. Artif. Intell..

[24]  Alicia Fornés,et al.  Information Extraction in Handwritten Marriage Licenses Books Using the MGGI Methodology , 2017, IbPRIA.

[25]  Bidyut Baran Chaudhuri,et al.  Named Entity Recognition from Unstructured Handwritten Document Images , 2016, 2016 12th IAPR Workshop on Document Analysis Systems (DAS).

[26]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[28]  Christopher D. Manning,et al.  Leveraging Linguistic Structure For Open Domain Information Extraction , 2015, ACL.

[29]  Samy Bengio,et al.  Offline recognition of unconstrained handwritten texts using HMMs and statistical language models , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Konstantinos Zagoris,et al.  ICFHR2016 Handwritten Keyword Spotting Competition (H-KWS 2016) , 2016, 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR).

[31]  Horst Bunke,et al.  TV-gram language models for offline handwritten text recognition , 2004, Ninth International Workshop on Frontiers in Handwriting Recognition.