论文信息 - Comparing Different Text Similarity Methods

Comparing Different Text Similarity Methods

This paper reports experiments on a corpus of news articles from the Financial Times, comparing different text similarity models. First the Ferret system using a method based solely on lexical similarities is used, then methods based on semantic similarities are investigated. Different feature string selection criteria are used, for instance with and without synonyms obtained from WordNet, or with noun phrases extracted for comparison. The results indicate that synonyms rather than lexical strings are important for finding similar texts. Hypernyms and noun phrases also contribute to the identification of text similarity, though they are not better than synonyms. However, precision is a problem for the semantic similarity methods because too many irrelevant texts are retrieved.

[1] C. Lyon,et al. Demonstration of the Ferret Plagiarism Detector , 2006 .

[2] Peter C. R. Lane,et al. Copy detection in Chinese documents using Ferret , 2007, Lang. Resour. Evaluation.

[3] James A. Malcolm,et al. Detecting Short Passages of Similar Text in Large Document Collections , 2001, EMNLP.

[4] Peter C. R. Lane,et al. Copy detection in Chinese documents using the Ferret: a report on experiments , 2006 .

[5] Hinrich Schütze,et al. Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[6] Xiao-Dong Liu,et al. Finding Plagiarism Based on Common Semantic Sequence Model , 2004, WAIM.

[7] Xiao-Dong Liu,et al. Semantic Sequence Kin: A Method of Document Copy Detection , 2004, PAKDD.

[8] James A. Malcolm,et al. A theoretical basis to the automated detection of copying between texts, and its practical implementation in the Ferret plagiarism and collusion detector , 2004 .