The automatic extraction of relevant information from business documents (sender, recipient, date, etc.) is a valuable task in the application domain of document management and archiving. Although current scientific and commercial self-learning solutions for document classification and extraction work pretty well, they still require a high effort of on-site configuration done by domain experts and administrators. Small office/home office (SOHO) users and private individuals do often not benefit from such systems. A low extraction effi- ciency especially in the starting period due to a small number of initially available example documents and a high effort to annotate new documents, drastically lowers their acceptance to use a self-learning information extraction system. Therefore we present a solution for information extraction that fits the requirements of these users. It adopts the idea of one-shot learning from computer vision to the domain of business document processing and requires only a minimal number of training to reach competitive extraction efficiency. Our evaluation on a document set of 12,500 documents following 399 different layouts/templates shows extraction results of 88% F1 score on 10 commonly used fields like document type, sender, recipient, and date. We already reach an F1 score of 78% with only one document of each template in the training set.
[1]
Evgeniy Bart,et al.
Information extraction by finding repeated structure
,
2010,
DAS '10.
[2]
Beth Sundheim,et al.
MUC-5 Evaluation Metrics
,
1993,
MUC.
[3]
Bertin Klein,et al.
Results of a Study on Invoice-Reading Systems in Germany
,
2004,
Document Analysis Systems.
[4]
Pietro Perona,et al.
One-shot learning of object categories
,
2006,
IEEE Transactions on Pattern Analysis and Machine Intelligence.
[5]
Alexander Schill,et al.
Intellix -- End-User Trained Information Extraction for Document Archiving
,
2013,
2013 12th International Conference on Document Analysis and Recognition.
[6]
Eric Saund.
Scientific challenges underlying production document processing
,
2011,
Electronic Imaging.
[7]
Bertin Klein,et al.
smartFIX: A Requirements-Driven System for Document Analysis and Understanding
,
2002,
Document Analysis Systems.
[8]
Vincent Lemaire,et al.
Learning with few examples: An empirical study on leading classifiers
,
2011,
The 2011 International Joint Conference on Neural Networks.
[9]
Eric Medvet,et al.
A probabilistic approach to printed document understanding
,
2011,
International Journal on Document Analysis and Recognition (IJDAR).