Distance measures for layout-based document image retrieval

Most methods for document image retrieval rely solely on text information to find similar documents. This paper describes a way to use layout information for document image retrieval instead. A new class of distance measures is introduced for documents with Manhattan layouts, based on a two-step procedure: First, the distances between the blocks of two layouts are calculated. Then, the blocks of one layout are assigned to the blocks of the other layout in a matching step. Different block distances and matching methods are compared and evaluated using the publicly available MARG database. On this dataset, the layout type can be determined successfully in 92.6% of the cases using the best distance measure in a nearest neighbor classifier. The experiments show that the best distance measure for this task is the overlapping area combined with the Manhattan distance of the corner points as block distance together with the minimum weight edge cover matching

[1]  Thomas M. Breuel,et al.  Performance Comparison of Six Algorithms for Page Segmentation , 2006, Document Analysis Systems.

[2]  Mahesh Viswanathan,et al.  A prototype document image analysis system for technical journals , 1992, Computer.

[3]  Song Mao,et al.  Empirical Performance Evaluation Methodology and Its Application to Page Segmentation Algorithms , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Motoi Iwata,et al.  Segmentation of Page Images Using the Area Voronoi Diagram , 1998, Comput. Vis. Image Underst..

[5]  Luc Vincent,et al.  Pink Panther: A Complete Environment For Ground-Truthing And Benchmarking Document Page Segmentation , 1998, Pattern Recognit..

[6]  Horst Bunke,et al.  Document Image Analysis , 1994, Series in Machine Perception and Artificial Intelligence.

[7]  Donald E. Knuth,et al.  The Stanford GraphBase - a platform for combinatorial computing , 1993 .

[8]  Thomas M. Breuel,et al.  Two Geometric Algorithms for Layout Analysis , 2002, Document Analysis Systems.

[9]  Jianying Hu,et al.  Document classification using layout analysis , 1999, Proceedings. Tenth International Workshop on Database and Expert Systems Applications. DEXA 99.

[10]  Henry S. Baird Background Structure in Document Images , 1994, Int. J. Pattern Recognit. Artif. Intell..

[11]  C. Tomasi The Earth Mover's Distance, Multi-Dimensional Scaling, and Color-Based Image Retrieval , 1997 .

[12]  Hermann Ney,et al.  Pixel-to-Pixel Matching for Image Recognition Using Hungarian Graph Matching , 2004, DAGM-Symposium.

[13]  Robert M. Haralick,et al.  Performance Evaluation of Document Structure Extraction Algorithms , 2001, Comput. Vis. Image Underst..

[14]  Hermann Ney,et al.  Classification error rate for quantitative evaluation of content-based image retrieval systems , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[15]  Lawrence O'Gorman,et al.  The Document Spectrum for Page Layout Analysis , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  Friedrich M. Wahl,et al.  Document Analysis System , 1982, IBM J. Res. Dev..