One-Shot Template Matching for Automatic Document Data Capture

In this paper, we propose a novel one-shot template-matching algorithm to automatically capture data from business documents with an aim to minimize manual data entry. Given one annotated document, our algorithm can automatically extract similar data from other documents having the same format. Based on a set of engineered visual and textual features, our method is invariant to changes in position and value. Experiments on a dataset of 595 real invoices demonstrate 86.4% accuracy.

[1]  Alexander Schill,et al.  Intellix -- End-User Trained Information Extraction for Document Archiving , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[2]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Francesca Cesarini,et al.  INFORMys: A Flexible Invoice-Like Form-Reader System , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Ali Farhadi,et al.  YOLOv3: An Incremental Improvement , 2018, ArXiv.

[5]  Ole Winther,et al.  CloudScan - A Configuration-Free Invoice Analysis System Using Recurrent Neural Networks , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[6]  Bi Liu,et al.  A Normalized Levenshtein Distance Metric , 2007, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Andrew Zisserman,et al.  Near Duplicate Image Detection: min-Hash and tf-idf Weighting , 2008, BMVC.

[8]  Andrew Chisholm,et al.  Extracting structured data from invoices , 2018, ALTA.

[9]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Xiang Fu,et al.  A Matching Method Based on SVD for Image Retrieval , 2009, 2009 International Conference on Measuring Technology and Mechatronics Automation.

[11]  Vincent Poulain D'Andecy,et al.  Field Extraction from Administrative Documents by Incremental Structural Templates , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[12]  Jiao Yu-hua,et al.  An Overview of Perceptual Hashing , 2008 .

[13]  Vincent Poulain D'Andecy,et al.  Field Extraction by Hybrid Incremental and A-Priori Structural Templates , 2018, 2018 13th IAPR International Workshop on Document Analysis Systems (DAS).

[14]  Martin Holecek,et al.  Table Understanding in Structured Documents , 2019, 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW).

[15]  Martin Holecek,et al.  Line-items and table understanding in structured documents , 2019, ArXiv.