Fine-grained document genre classification using first order random graphs

We approach the general problem of classifying machine-printed documents into genres. Layout is a critical factor in recognizing fine-grained genres, as document content features are similar. Document genre is determined from the layout structure detected from scanned binary images of the document pages, using no OCR results and minimal a priori knowledge of document logical structures. Our method uses the attributed relational graphs (ARGs) to represent the layout structure of document instances, and the first order random graphs (FORGs) to represent document genres. In this paper we develop our FORG-based genre classification method and present a comparative evaluation between our technique and a variety of statistical pattern classifiers. FORGs are capable of modeling common layout structure within a document genre and are shown to significantly outperform traditional pattern classification techniques when fine-grained genre distinctions must be drawn.

[1]  Sargur N. Srihari,et al.  Postal address block location in real time , 1992, Computer.

[2]  Horst Bunke,et al.  Recent developments in graph matching , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[3]  Stephan Baumann,et al.  Advances in Document Classification by Voting of Competitive Approaches , 1996, DAS.

[4]  David S. Doermann,et al.  Classification of document page images based on visual similarity of layout structures , 1999, Electronic Imaging.

[5]  Béla Bollobás,et al.  Random Graphs , 1985 .

[6]  Simon Kasif,et al.  A System for Induction of Oblique Decision Trees , 1994, J. Artif. Intell. Res..

[7]  Alberto Sanfeliu,et al.  Synthesis of Function-Described Graphs , 1998, SSPR/SPR.

[8]  Francesca Cesarini,et al.  A two level knowledge approach for understanding documents of a multi-class domain , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[9]  Azriel Rosenfeld,et al.  The function of documents , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.