A novel document similarity measure based on earth mover's distance

In this paper we propose a novel measure based on the earth mover's distance (EMD) to evaluate document similarity by allowing many-to-many matching between subtopics. First, each document is decomposed into a set of subtopics, and then the EMD is employed to evaluate the similarity between two sets of subtopics for two documents by solving the transportation problem. The proposed measure is an improvement of the previous OM-based measure, which allows only one-to-one matching between subtopics. Experiments have been performed on the TDT3 dataset to evaluate existing similarity measures and the results show that the EMD-based measure outperforms the optimal matching (OM) based measure and all other measures. In addition to the TextTiling algorithm, the sentence clustering algorithm is adopted for document decomposition, and the experimental results show that the proposed EMD-based measure does not rely on the document decomposition algorithm and thus it is more robust than the OM-based measure.

[1]  Philip S. Yu,et al.  On effective conceptual indexing and similarity search in text data , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[2]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[3]  Javed A. Aslam,et al.  An information-theoretic measure for document similarity , 2003, SIGIR.

[4]  Stefan Kaufmann Cohesion and Collocation: Using Context Vectors in Text Segmentation , 1999, ACL.

[5]  Marti A. Hearst Multi-Paragraph Segmentation Expository Text , 1994, ACL.

[6]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[7]  Christopher C. Yang,et al.  Measuring similarity of semi-structured documents with context weights , 2006, SIGIR '06.

[8]  F. L. Hitchcock The Distribution of a Product from Several Sources to Numerous Localities , 1941 .

[9]  Kaizhong Zhang,et al.  A new algorithm for computing similarity between RNA structures , 2001, Inf. Sci..

[10]  Amit Singhal,et al.  Pivoted document length normalization , 1996, SIGIR 1996.

[11]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[12]  Scott Dick,et al.  A similarity measure for fuzzy rulebases based on linguistic gradients , 2006, Inf. Sci..

[13]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[14]  Anna Formica,et al.  Ontology-based concept similarity in Formal Concept Analysis , 2006, Inf. Sci..

[15]  H. V. Jagadish,et al.  Evaluating Structural Similarity in XML Documents , 2002, WebDB.

[16]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[17]  Yiming Yang,et al.  Topic Detection and Tracking Pilot Study Final Report , 1998 .

[18]  Takao Nishizeki,et al.  Graph Theory and Algorithms , 1981, Lecture Notes in Computer Science.

[19]  Hongyuan Zha,et al.  Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering , 2002, SIGIR '02.

[20]  Stephen E. Robertson,et al.  Okapi at TREC-7: Automatic Ad Hoc, Filtering, VLC and Interactive , 1998, TREC.

[21]  彭宇新,et al.  A New Retrieval Model Based on TextTiling for Document Similarity Search , 2005 .

[22]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[23]  Alexander Schrijver,et al.  Combinatorial optimization. Polyhedra and efficiency. , 2003 .

[24]  Ellen M. Voorhees,et al.  Overview of TREC 2001 , 2001, TREC.

[25]  G. Lieberman,et al.  Introduction to Mathematical Programming , 1990 .

[26]  Narendra Karmarkar,et al.  A new polynomial-time algorithm for linear programming , 1984, STOC '84.