Few-exemplar Information Extraction for Business Documents

The automatic extraction of relevant information from business documents (sender, recipient, date, etc.) is a valuable task in the application domain of document management and archiving. Although current scientific and commercial self-learning solutions for document classification and extraction work pretty well, they still require a high effort of on-site configuration done by domain experts and administrators. Small office/home office (SOHO) users and private individuals do often not benefit from such systems. A low extraction effi- ciency especially in the starting period due to a small number of initially available example documents and a high effort to annotate new documents, drastically lowers their acceptance to use a self-learning information extraction system. Therefore we present a solution for information extraction that fits the requirements of these users. It adopts the idea of one-shot learning from computer vision to the domain of business document processing and requires only a minimal number of training to reach competitive extraction efficiency. Our evaluation on a document set of 12,500 documents following 399 different layouts/templates shows extraction results of 88% F1 score on 10 commonly used fields like document type, sender, recipient, and date. We already reach an F1 score of 78% with only one document of each template in the training set.

[1]  Evgeniy Bart,et al.  Information extraction by finding repeated structure , 2010, DAS '10.

[2]  Beth Sundheim,et al.  MUC-5 Evaluation Metrics , 1993, MUC.

[3]  Bertin Klein,et al.  Results of a Study on Invoice-Reading Systems in Germany , 2004, Document Analysis Systems.

[4]  Pietro Perona,et al.  One-shot learning of object categories , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Alexander Schill,et al.  Intellix -- End-User Trained Information Extraction for Document Archiving , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[6]  Eric Saund Scientific challenges underlying production document processing , 2011, Electronic Imaging.

[7]  Bertin Klein,et al.  smartFIX: A Requirements-Driven System for Document Analysis and Understanding , 2002, Document Analysis Systems.

[8]  Vincent Lemaire,et al.  Learning with few examples: An empirical study on leading classifiers , 2011, The 2011 International Joint Conference on Neural Networks.

[9]  Eric Medvet,et al.  A probabilistic approach to printed document understanding , 2011, International Journal on Document Analysis and Recognition (IJDAR).