Document image and zone classification through incremental learning

We present an incremental learning method for document image and zone classification. We consider an industrial context where the system faces a large variability of digitized administrative documents that become available progressively over time. Each new incoming document is segmented into physical regions (zones) which are classified according to a zonemodel. We represent the document by means of its classified zones and we classify the document according to a document-model. The classification relies on a reject utility in order to reject ambiguous zones or documents. Models are updated by incrementally learning each new document and its extracted zones. We validate the method on real administrative document images and we achieve a recognition rate of more than 92%.

[1]  David S. Doermann,et al.  Machine printed text and handwriting identification in noisy document images , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Yolande Belaïd,et al.  Administrative Document Analysis and Structure , 2011, Learning Structure and Schemas from Documents.

[3]  Thomas M. Breuel,et al.  Document image zone classification - a simple high-performance approach , 2007, VISAPP.

[4]  Parag Kulkarni,et al.  Incremental Learning: Areas and Methods - A Survey , 2012 .

[5]  Christoph Goller,et al.  Automatic Document Classification - A thorough Evaluation of various Methods , 2000, ISI.

[6]  Wael Abd-Almageed,et al.  Document-zone classification using partial least squares and hybrid classifiers , 2008, 2008 19th International Conference on Pattern Recognition.

[7]  George Nagy,et al.  HIERARCHICAL REPRESENTATION OF OPTICALLY SCANNED DOCUMENTS , 1984 .

[8]  Lawrence O'Gorman,et al.  The Document Spectrum for Page Layout Analysis , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Basilios Gatos,et al.  Page Segmentation Competition , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[10]  Dorothea Blostein,et al.  A survey of document image classification: problem statement, classifier architecture and performance evaluation , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[11]  Yalin Wang,et al.  Document zone content classification and its performance evaluation , 2006, Pattern Recognit..

[12]  Apostolos Antonacopoulos,et al.  ICDAR 2009 Page Segmentation Competition , 2003, 2009 10th International Conference on Document Analysis and Recognition.

[13]  Dianhong Wang,et al.  Survey of Improving K-Nearest-Neighbor for Classification , 2007, Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007).