论文信息 - The Similarity Computing of Documents Based on VSM

The Similarity Computing of Documents Based on VSM

The precision and efficiency of the similarity computing of documents is the foundation and key of other documents processing. In this paper, the DF and TF-IDF algorithms are improved. First, DF's time complexity is linear which suits mass documents processing, but it has the fault that exceptional useful features may be deleted, so we make up that by adding the count of the words at the important places. Second, we rectify the weight of feature by the result of feature selection phase. In this way, we improve the precision of documents similarity without adding much time and space complexity.

Qinglin Guo

[1] Yiming Yang,et al. Noise reduction in a statistical approach to text categorization , 1995, SIGIR '95.

[2] Wei-Ying Ma,et al. An Evaluation on Feature Selection for Text Clustering , 2003, ICML.

[3] Yiming Yang,et al. A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[4] Zhang Yan-ping. A class-based feature selection algorithm for test clustering , 2007 .

[5] Huan Liu,et al. Feature Selection for Classification , 1997, Intell. Data Anal..

[6] Maria Simi,et al. Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization , 2000, ECDL.