Named Entity Recognition in Semi Structured Documents Using Neural Tensor Networks

Information Extraction and Named Entity Recognition algorithms derive major applications related to many practical document analysis system. Semi structured documents pose several challenges when it comes to extract relevant information from these documents. The state-of-the-art methods heavily rely on feature engineering to perform layout-specific extraction of information and therefore do not generalize well. Extracting information without taking the document layout into consideration is required as a first step to develop a general solution to this problem. To address this challenge, we propose a deep learning based pipeline to extract information from documents. For this purpose, we define ‘information’ to be a set of entities that have a label and a corresponding value, e.g., application_number: ADNF8932NF and submission_date: 15FEB19. We form relational triplets by connecting one entity to another via a relationship, such as (max_temperature, is, 100 degrees) and train a neural tensor network that is well-suited for this kind of data to predict high confidence scores for true triplets. Up to 96% test accuracy on real world documents from publicly available GHEGA dataset demonstrate the effectiveness of our approach.

[1]  Le Song,et al.  Know-Evolve: Deep Temporal Reasoning for Dynamic Knowledge Graphs , 2017, ICML.

[2]  Andreas R. Dengel,et al.  Making documents work: challenges for document understanding , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[3]  Thomas M. Breuel,et al.  Combined orientation and skew detection using geometric text-line modeling , 2009, International Journal on Document Analysis and Recognition (IJDAR).

[4]  E. Medvet,et al.  A domain knowledge-based approach for automatic correction of printed invoices , 2012, International Conference on Information Society (i-Society 2012).

[5]  Yu Hu,et al.  Combing Context and Commonsense Knowledge Through Neural Networks for Solving Winograd Schema Problems , 2017, AAAI Spring Symposia.

[6]  Vincent Poulain D'Andecy,et al.  Field Extraction from Administrative Documents by Incremental Structural Templates , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[7]  Alexander Schill,et al.  Automatic indexing of scanned documents: a layout-based approach , 2012, Electronic Imaging.

[8]  Kaile Su,et al.  Symbolic manipulation based on deep neural networks and its application to axiom discovery , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[9]  Eric Medvet,et al.  A probabilistic approach to printed document understanding , 2011, International Journal on Document Analysis and Recognition (IJDAR).

[10]  Thomas M. Breuel,et al.  The OCRopus open source OCR system , 2008, Electronic Imaging.

[11]  Luis A. Guerrero,et al.  Alexa vs. Siri vs. Cortana vs. Google Assistant: A Comparison of Speech-Based Natural User Interfaces , 2017 .

[12]  Danqi Chen,et al.  Reasoning With Neural Tensor Networks for Knowledge Base Completion , 2013, NIPS.

[13]  Thomas M. Breuel,et al.  Efficient implementation of local adaptive thresholding techniques using integral images , 2008, Electronic Imaging.

[14]  Alexander Schill,et al.  Intellix -- End-User Trained Information Extraction for Document Archiving , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[15]  Andrew McCallum,et al.  Fast and Accurate Sequence Labeling with Iterated Dilated Convolutions , 2017, ArXiv.

[16]  Francesca Cesarini,et al.  Analysis and understanding of multi-class invoices , 2003, Document Analysis and Recognition.

[17]  Andrew McCallum,et al.  RelNet: End-to-end Modeling of Entities & Relations , 2017, AKBC@NIPS.