Multipage Administrative Document Stream Segmentation

We propose in this paper a framework for the segmentation and classification of document streams. The framework is composed of two modules: segmentation and verification. The two modules use an incremental classifier which learns progressively along the stream. In the segmentation module a relationship between two consecutive pages is classified as either: continuity or rupture. Rupture is synonymous of a clear break, thus probably a complete document. If the classifier is uncertain on whether the relationship should be a continuity or a rupture, an over-segmentation is proposed and we consider that we have a fragment i.e. portion of a document. Both fragments and documents are sent to the verification module where additionally to the incremental classifier it includes a correction module. The classifier predicts the classes of fragments and documents. The predicted class represents a context which is used as a query to search for similar contexts in the correction module and correct the segmentation and verification results. Corrections are sent back to the segmentation and verification modules to learn the correct classes. Results on real world databases show the effectiveness and stability of our approach.

[1]  David Furcy,et al.  Limited Discrepancy Beam Search , 2005, IJCAI.

[2]  Abdel Belaïd,et al.  Segmentation of continuous document flow by a modified backward-forward algorithm , 2009, Electronic Imaging.

[3]  Albert Gordo,et al.  Document Classification and Page Stream Segmentation for Digital Mailroom Applications , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[4]  Yolande Belaïd,et al.  A Stream-Based Semi-supervised Active Learning Approach for Document Classification , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[5]  Josep Lladós,et al.  Multipage document retrieval by textual and visual representations , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[6]  David S. Doermann,et al.  Learning document structure for retrieval and classification , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[7]  David S. Doermann,et al.  Document Image Retrieval Based on Layout Structural Similarity , 2006, IPCV.

[8]  Yolande Belaïd,et al.  Document image and zone classification through incremental learning , 2013, 2013 IEEE International Conference on Image Processing.

[9]  Albert Gordo,et al.  A Bag-of-Pages Approach to Unordered Multi-page Document Classification , 2010, 2010 20th International Conference on Pattern Recognition.

[10]  Azriel Rosenfeld,et al.  Classification of document pages using structure-based features , 2001, International Journal on Document Analysis and Recognition.

[11]  Kevyn Collins-Thompson A Clustering-Based Algorithm for Automatic Document Separation , 2002 .

[12]  Jan W. Amtrup,et al.  Automatic Document Separation: A Combination of Probabilistic Classification and Finite-State Sequence Modeling , 2007 .