论文信息 - Clustering Web Retrieval Results Accompanied by Removing Duplicate Documents

Clustering Web Retrieval Results Accompanied by Removing Duplicate Documents

Since keyword-based search engine usually return large amount of results in which there are many unrelated documents and many documents with same content, automatic clustering technology is used to classify the retrieval results. While there are large amount of Web retrieval results, the clustering process usually costs long time and the clusters are not friendly to users since there are still many documents with same content. This paper proposed an improved clustering method by removing the duplicate documents from retrieval results. The removal operation is executed first in initial partition stage during clustering. Then it is executed again after the initial partition stage to remove the duplicate documents thoroughly. We proposed an efficient removal method in this stage. At last, we made experiment to verify our method.

Xinye Li | Qinhai Yang | LinNa Zeng

[1] Udi Manber,et al. Finding Similar Files in a Large File System , 1994, USENIX Winter.

[2] Sugato Basu. Semi-supervised Clustering: Learning with Limited User Feedback , 2004 .

[3] Pang Jian,et al. Research and Implementation of Text Categorization System Based on VSM , 2001 .

[4] Li Sheng. Search Result Clustering Based on Centroid Optimization by Ontology Extraction , 2008 .

[5] Hector Garcia-Molina,et al. SCAM: A Copy Detection Mechanism for Digital Documents , 1995, DL.

[6] Liu Hai-feng. Research on the VSM Text Retrieval Based on Clustering , 2006 .

[7] Hector Garcia-Molina,et al. Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[8] Shen Qiang. A hierarchical search results clustering method based on K-Means , 2010 .

[9] Douglas M. Campbell,et al. Copy detection systems for digital documents , 2000, Proceedings IEEE Advances in Digital Libraries 2000.

[10] Xiao-Dong Liu,et al. A fast document copy detection model , 2006, Soft Comput..