Bench-Marking Information Extraction in Semi-Structured Historical Handwritten Records

In this report, we present our findings from benchmarking experiments for information extraction on historical handwritten marriage records Esposalles from IEHHR - ICDAR 2017 robust reading competition. The information extraction is modeled as semantic labeling of the sequence across 2 set of labels. This can be achieved by sequentially or jointly applying handwritten text recognition (HTR) and named entity recognition (NER). We deploy a pipeline approach where first we use state-of-the-art HTR and use its output as input for NER. We show that given low resource setup and simple structure of the records, high performance of HTR ensures overall high performance. We explore the various configurations of conditional random fields and neural networks to benchmark NER on given certain noisy input. The best model on 10-fold cross-validation as well as blind test data uses n-gram features with bidirectional long short-term memory.

[1]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[2]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[3]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[4]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[5]  Alicia Fornés,et al.  Joint Recognition of Handwritten Text and Named Entities with a Neural End-to-End Model , 2018, 2018 13th IAPR International Workshop on Document Analysis Systems (DAS).

[6]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[7]  Alicia Fornés,et al.  The ESPOSALLES database: An ancient marriage license corpus for off-line handwriting recognition , 2013, Pattern Recognit..

[8]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[9]  Sven Behnke,et al.  PyStruct: learning structured prediction in python , 2014, J. Mach. Learn. Res..

[10]  Jean-Luc Meunier PyStruct Extension for Typed CRF Graphs , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[11]  Alicia Fornés,et al.  ICDAR2017 Competition on Information Extraction in Historical Handwritten Records , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[12]  Volkmar Frinken,et al.  A Novel Word Spotting Method Based on Recurrent Neural Networks , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Hervé Déjean,et al.  Extracting structured data from unstructured document with incomplete resources , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[14]  Roger Labahn,et al.  System Description of CITlab's Recognition & Retrieval Engine for ICDAR2017 Competition on Information Extraction in Historical Handwritten Records , 2018, ArXiv.

[15]  Tobias Grüning,et al.  CITlab ARGUS for historical handwritten documents , 2016, ArXiv.