Similarity measure for CCITT Group 4 compressed document images

The similarity measure of document images has a crucial role in the area of document image retrieval. A method of measuring the similarity of CCITT Group 4 compressed document images is proposed. The features are extracted directly from the changing elements of the compressed images. Weighted Hausdorff distance is utilized to assign all of the word objects from two document images to corresponding classes by an unsupervised classifier, whereas the possible stop words are excluded. Document vectors are built by the occurrence frequency of the word object classes, and the pair-wise similarity of two document images is represented by the scalar product of the document vectors. Five group articles relating to different domains are used to test the validity of the presented approach.

[1]  Jonathan J. Hull,et al.  Duplicate detection for symbolically compressed documents , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[2]  Weidong Kou,et al.  Digital Image Compression , 1995 .

[3]  Chew Lim Tan,et al.  Text Retrieval from Document Images based on N-Gram Algorithm , 2000, PRICAI Workshop on Text and Web Mining.

[4]  Jonathan J. Hull Document matching on CCITT Group 4 compressed images , 1997, Electronic Imaging.

[5]  Anil K. Jain,et al.  A modified Hausdorff distance for object matching , 1994, Proceedings of 12th International Conference on Pattern Recognition.

[6]  Daniel P. Huttenlocher,et al.  Comparing Images Using the Hausdorff Distance , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  David S. Doermann,et al.  The retrieval of document images: a brief survey , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.