Automatic Static/Variable Content Separation in Administrative Document Images

In this paper we present an automatic method for separating static and variable content from administrative document images. An alignment approach is able to unsupervisedly build probabilistic templates from a set of examples of the same document kind. Such templates define which is the likelihood of every pixel of being either static or variable content. In the extraction step, the same alignment technique is used to match an incoming image with the template and to locate the positions where variable fields appear. We validate our approach on the public NIST Structured Tax Forms Dataset.

[1]  Josep Lladós,et al.  Multipage document retrieval by textual and visual representations , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[2]  Georgios D. Evangelidis,et al.  Parametric Image Alignment Using Enhanced Correlation Coefficient Maximization , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Vincent Poulain D'Andecy,et al.  Field Extraction from Administrative Documents by Incremental Structural Templates , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[4]  Yasuto Ishitani Model-based Information Extraction Method Tolerant of OCR Errors for Document Images , 2002, Int. J. Comput. Process. Orient. Lang..

[5]  Flávio S. Corrêa da Silva,et al.  Semantic information extraction from images of complex documents , 2012, Applied Intelligence.

[6]  Alexander Schill,et al.  Intellix -- End-User Trained Information Extraction for Document Archiving , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[7]  Volkmar Frinken,et al.  Multimodal page classification in administrative document image streams , 2014, International Journal on Document Analysis and Recognition (IJDAR).

[8]  Stefanos Zafeiriou,et al.  Feature-Based Lucas–Kanade and Active Appearance Models , 2015, IEEE Transactions on Image Processing.

[9]  Stefanos Zafeiriou,et al.  Robust and efficient parametric face alignment , 2011, 2011 International Conference on Computer Vision.

[10]  Dorothea Blostein,et al.  A survey of document image classification: problem statement, classifier architecture and performance evaluation , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[11]  Abdel Belaïd,et al.  Pattern-Based Approach to Table Extraction , 2013, IbPRIA.

[12]  Simon Baker,et al.  Active Appearance Models Revisited , 2004, International Journal of Computer Vision.

[13]  Paul A. Viola,et al.  Automatic Fax Routing , 2004, Document Analysis Systems.

[14]  Matti Pietikäinen,et al.  Adaptive document image binarization , 2000, Pattern Recognit..

[15]  Yasuto Ishitani Model-Based Information Extraction and its Applications for Document Images , 2001 .

[16]  Lucas J. van Vliet,et al.  Recursive Gaussian derivative filters , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[17]  Simon Baker,et al.  Lucas-Kanade 20 Years On: A Unifying Framework , 2004, International Journal of Computer Vision.

[18]  Éric Trupin,et al.  Classification method study for automatic form class identification , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[19]  Takeo Kanade,et al.  An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.

[20]  N. Otsu A threshold selection method from gray level histograms , 1979 .