论文信息 - Few-exemplar Information Extraction for Business Documents

Few-exemplar Information Extraction for Business Documents

The automatic extraction of relevant information from business documents (sender, recipient, date, etc.) is a valuable task in the application domain of document management and archiving. Although current scientific and commercial self-learning solutions for document classification and extraction work pretty well, they still require a high effort of on-site configuration done by domain experts and administrators. Small office/home office (SOHO) users and private individuals do often not benefit from such systems. A low extraction effi- ciency especially in the starting period due to a small number of initially available example documents and a high effort to annotate new documents, drastically lowers their acceptance to use a self-learning information extraction system. Therefore we present a solution for information extraction that fits the requirements of these users. It adopts the idea of one-shot learning from computer vision to the domain of business document processing and requires only a minimal number of training to reach competitive extraction efficiency. Our evaluation on a document set of 12,500 documents following 399 different layouts/templates shows extraction results of 88% F1 score on 10 commonly used fields like document type, sender, recipient, and date. We already reach an F1 score of 78% with only one document of each template in the training set.

Alexander Schill | Daniel Schuster | Daniel Esser | Klemens Muthmann

[1] Evgeniy Bart,et al. Information extraction by finding repeated structure , 2010, DAS '10.

[2] Beth Sundheim,et al. MUC-5 Evaluation Metrics , 1993, MUC.

[3] Bertin Klein,et al. Results of a Study on Invoice-Reading Systems in Germany , 2004, Document Analysis Systems.

[4] Pietro Perona,et al. One-shot learning of object categories , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5] Alexander Schill,et al. Intellix -- End-User Trained Information Extraction for Document Archiving , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[6] Eric Saund. Scientific challenges underlying production document processing , 2011, Electronic Imaging.

[7] Bertin Klein,et al. smartFIX: A Requirements-Driven System for Document Analysis and Understanding , 2002, Document Analysis Systems.

[8] Vincent Lemaire,et al. Learning with few examples: An empirical study on leading classifiers , 2011, The 2011 International Joint Conference on Neural Networks.

[9] Eric Medvet,et al. A probabilistic approach to printed document understanding , 2011, International Journal on Document Analysis and Recognition (IJDAR).