A comparison of sequential and combined approaches for named entity recognition in a corpus of handwritten medieval charters

This paper introduces a new corpus of multilingual medieval handwritten charter images, annotated with full transcription and named entities. The corpus is used to compare two approaches for named entity recognition in historical document images in several languages: on the one hand, a sequential approach, more commonly used, that sequentially applies handwritten text recognition (HTR) and named entity recognition (NER), on the other hand, a combined approach that simultaneously transcribes the image text line and extracts the entities. Experiments conducted on the charter corpus in Latin, early new high German and old Czech for name, date and location recognition demonstrate a superior performance of the combined approach.

[1]  Alejandro Héctor Toselli Rossi,et al.  From HMMs to RNNs: Computer-Assisted Transcription of a Handwritten Notarial Records Collection , 2018, 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR).

[2]  Shinji Watanabe,et al.  Using ASR Methods for OCR , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[3]  Alicia Fornés,et al.  Joint Recognition of Handwritten Text and Named Entities with a Neural End-to-End Model , 2018, 2018 13th IAPR International Workshop on Document Analysis Systems (DAS).

[4]  Patricia Murrieta-Flores,et al.  Ensemble Named Entity Recognition (NER): Evaluating NER Tools in the Identification of Place Names in Historical Corpora , 2018, Front. Digit. Humanit..

[5]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[6]  Günter Mühlberger,et al.  Transkribus - A Service Platform for Transcription, Recognition and Retrieval of Historical Documents , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[7]  Steven Bethard,et al.  A Survey on Recent Advances in Named Entity Recognition from Deep Learning models , 2018, COLING.

[8]  Joan Puigcerver,et al.  Are Multidimensional Recurrent Layers Really Necessary for Handwritten Text Recognition? , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[9]  I. Pars REGESTA DIPLOMATICA NEC NON EPISTOLARIA BOHEMIAE ET MORAVIAE , 2007 .

[10]  Kimmo Kettunen,et al.  Modern Tools for Old Content - in Search of Named Entities in a Finnish OCRed Historical Newspaper Collection 1771-1910 , 2016, LWDA.

[11]  Théodore Bluche,et al.  Deep Neural Networks for Large Vocabulary Handwritten Text Recognition , 2015 .

[12]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[13]  Yannick Estève,et al.  End-To-End Named Entity And Semantic Concept Extraction From Speech , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[14]  Xiang Bai,et al.  An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Eduard H. Hovy,et al.  End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF , 2016, ACL.

[16]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[17]  Mickaël Coustaty,et al.  An Analysis of the Performance of Named Entity Recognition over OCRed Documents , 2019, 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL).

[18]  Alexander Erdmann,et al.  Practical, Efficient, and Customizable Active Learning for Named Entity Recognition in the Digital Humanities , 2019, NAACL.

[19]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[20]  Verónica Romero,et al.  On the Modification of Binarization Algorithms to Retain Grayscale Information for Handwritten Text Recognition , 2015, IbPRIA.

[21]  Michal Konkol,et al.  Named Entity Recognition for Highly Inflectional Languages: Effects of Various Lemmatization and Stemming Approaches , 2014, TSD.

[22]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.