A comparative study on unsupervised feature selection methods for text clustering

Text clustering is one of the central problems in text mining and information retrieval area. For the high dimensionality of feature space and the inherent data sparsity, performance of clustering algorithms will dramatically decline. Two techniques are used to deal with this problem: feature extraction and feature selection. Feature selection methods have been successfully applied to text categorization but seldom applied to text clustering due to the unavailability of class label information. In this paper, four unsupervised feature selection methods, DF, TC, TVQ, and a new proposed method TV are introduced. Experiments are taken to show that feature selection methods can improves efficiency as well as accuracy of text clustering. Three clustering validity criterions are studied and used to evaluate clustering results.

[1]  Philip S. Yu,et al.  Finding generalized projected clusters in high dimensional spaces , 2000, SIGMOD 2000.

[2]  Stefan Rüger,et al.  Feature Reduction for Document Clustering and Classification , 2000 .

[3]  Charles Nicholas,et al.  Feature Selection and Document Clustering , 2004 .

[4]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[5]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[6]  G Salton,et al.  Developments in Automatic Text Retrieval , 1991, Science.

[7]  Naftali Tishby,et al.  Document clustering using word clusters via the information bottleneck method , 2000, SIGIR '00.

[8]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[9]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[10]  A. K. Jain,et al.  A critical evaluation of intrinsic dimensionality algorithms. , 1980 .

[11]  G. Karypis,et al.  Criterion Functions for Document Clustering ∗ Experiments and Analysis , 2001 .

[12]  Philip S. Yu,et al.  Finding generalized projected clusters in high dimensional spaces , 2000, SIGMOD '00.

[13]  Wei-Ying Ma,et al.  An Evaluation on Feature Selection for Text Clustering , 2003, ICML.

[14]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[15]  David D. Lewis,et al.  Reuters-21578 Text Categorization Test Collection, Distribution 1.0 , 1997 .

[16]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.