Page classification through logical labelling

We propose an integrated approach to page classification and logical labelling. Layout is represented by a fully connected attributed relational graph that is matched to the graph of an unknown document, achieving classification and labelling simultaneously. By incorporating global constraints in an integrated fashion, ambiguity at the zone level can be reduced, providing robustness to noise and variation. Models are automatically trained from sample documents. Experimental results show promise for the classification and labelling of technical article title pages, and supports the idea of a hierarchical model base.

[1]  Steven Gold,et al.  A Graduated Assignment Algorithm for Graph Matching , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Yannis A. Dimitriadis,et al.  Structured document labeling and rule extraction using a new recurrent fuzzy-neural system , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[3]  T.A. Bayer,et al.  Experiments on extracting structural information from paper documents using syntactic pattern analysis , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[4]  Hanno Walischewski Learning regions of interest in postal automation , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[5]  David S. Doermann,et al.  Classification of document page images based on visual similarity of layout structures , 1999, Electronic Imaging.

[6]  Yasuto Ishitani Model-based Information Extraction Method Tolerant of OCR Errors for Document Images , 2002, Int. J. Comput. Process. Orient. Lang..

[7]  Donato Malerba,et al.  Transforming paper documents into XML format with WISDOM++ , 2001, International Journal on Document Analysis and Recognition.

[8]  Tao Hu,et al.  A Mixed Approach Toward an Efficient Logical Structure Recognition from Document Images , 1993, Electron. Publ..

[9]  Yasuto Ishitani,et al.  Model-based information extraction method tolerant of OCR errors for document images , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[10]  Mahesh Viswanathan,et al.  Syntactic Segmentation and Labeling of Digitized Pages from Technical Journals , 1993, IEEE Trans. Pattern Anal. Mach. Intell..