A Simple and Fast Term Selection Procedure for Text Clustering

Text clustering is a theme that is receiving considerable attention nowadays in areas such as text mining and information retrieval. A starting point for clustering methods applied on unstructured document collection is the creation of a vector-space model usually known as bag-ofwords model [1J. Documents are then usually described by a matrix which happens to be huge and extremely sparse which is due to the exceeding number of terms describing the set of documents. Although several techniques can be employed to reduce this number, the final figure is still high thus leading to a feature space of high dimensionality. This paper presents a simple procedure that not only considerably reduces the dimensionality of the feature space and hence the processing time, but also produces clustering performances comparable or even better when confronted with the full set of terms.

[1]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[2]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[3]  Andreas Hotho,et al.  A Brief Survey of Text Mining , 2005, LDV Forum.

[4]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[5]  I. Jolliffe Principal Component Analysis , 2002 .

[6]  Charles Nicholas,et al.  Feature Selection and Document Clustering , 2004 .

[7]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[8]  Yair Weiss,et al.  Segmentation using eigenvectors: a unifying view , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[9]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[10]  Jianchu Kang,et al.  A comparative study on unsupervised feature selection methods for text clustering , 2005, 2005 International Conference on Natural Language Processing and Knowledge Engineering.

[11]  Amit Konar,et al.  Document Clustering Using Differential Evolution , 2006, 2006 IEEE International Conference on Evolutionary Computation.

[12]  Thomas Gärtner,et al.  A survey of kernels for structured data , 2003, SKDD.

[13]  Pietro Perona,et al.  Self-Tuning Spectral Clustering , 2004, NIPS.

[14]  Alexandros Karatzoglou,et al.  Text Clustering with String Kernels in R , 2006, GfKl.

[15]  Yiming Yang,et al.  Noise reduction in a statistical approach to text categorization , 1995, SIGIR '95.

[16]  Dan Klein,et al.  Spectral Learning , 2003, IJCAI.

[17]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[18]  L Hunter,et al.  MedMiner: an Internet text-mining tool for biomedical information, with application to gene expression profiling. , 1999, BioTechniques.

[19]  David G. Stork,et al.  Pattern Classification , 1973 .

[20]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[21]  Dawid Weiss,et al.  Carrot2: Design of a Flexible and Efficient Web Information Retrieval Framework , 2005, AWIC.

[22]  Rainer Storn,et al.  Differential Evolution – A Simple and Efficient Heuristic for global Optimization over Continuous Spaces , 1997, J. Glob. Optim..

[23]  Huan Liu,et al.  Feature Selection for Clustering , 2000, Encyclopedia of Database Systems.

[24]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[25]  Yasunori Yamamoto,et al.  Biomedical knowledge navigation by literature clustering , 2007, J. Biomed. Informatics.

[26]  W. John Wilbur,et al.  The automatic identification of stop words , 1992, J. Inf. Sci..

[27]  Luis Mateus Rocha,et al.  Singular value decomposition and principal component analysis , 2003 .

[28]  Wei-Ying Ma,et al.  An Evaluation on Feature Selection for Text Clustering , 2003, ICML.

[29]  Inderjit S. Dhillon,et al.  Kernel k-means: spectral clustering and normalized cuts , 2004, KDD.

[30]  Gene H. Golub,et al.  Matrix computations , 1983 .

[31]  Colin Campbell,et al.  An introduction to kernel methods , 2001 .

[32]  Tijl De Bie,et al.  Eigenproblems in Pattern Recognition , 2005 .

[33]  Fan Chung,et al.  Spectral Graph Theory , 1996 .

[34]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[35]  Oren Etzioni,et al.  Grouper: A Dynamic Clustering Interface to Web Search Results , 1999, Comput. Networks.

[36]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[37]  Anton J. Enright,et al.  TEXTQUEST: Document Clustering of MEDLINE Abstracts For Concept Discovery In Molecular Biology , 2000, Pacific Symposium on Biocomputing.

[38]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[39]  C. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings , 1983 .

[40]  Lin-Yu Tseng,et al.  A genetic approach to the automatic clustering problem , 2001, Pattern Recognit..