Comparing and Combining Dimension Reduction Techniques for Efficient Text Clustering

A great challenge of text mining arises from the increasingly large text datasets and the high dimensionality associated with natural language. In this research, a systematic study is conducted of six Dimension Reduction Techniques (DRT) in the context of the text clustering problem using three standard benchmark datasets. The methods considered include three feature transformation techiques, Independent Component Analysis (ICA), Latent Semantic Indexing (LSI), Random Projection (RP), and three feature selection techniques based on Document Frequency (DF ), mean TfIdf (TI) and Term Frequency Variance (TfV ). Experiments with the k-means clustering algorithm show that ICA and LSI are clearly superior to RP on all three datasets. Furthermore,it is shown that TI and TfV outperform DF for text clustering. Finally, experiments where a selection technique is followed by a transformation technique show that this combination can help substantially reduce the computational cost associated with the best transformation methods (ICA and LSI) while preserving clustering performance.

[1]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[2]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[3]  Charles Nicholas,et al.  Feature Selection and Document Clustering , 2004 .

[4]  Malcolm I. Heywood A Comparative Study of Dimension Reduction Techniques for Document Clustering , .

[5]  Ata Kabán,et al.  Topic Identification in Dynamical Text by Complexity Pursuit , 2003, Neural Processing Letters.

[6]  Dimitrios Gunopulos,et al.  Dimensionality reduction by random projection and latent semantic indexing , 2003 .

[7]  G. Karypis,et al.  Criterion Functions for Document Clustering ∗ Experiments and Analysis , 2001 .

[8]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[9]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[10]  Dimitris Achlioptas,et al.  Database-friendly random projections , 2001, PODS.

[11]  Heikki Mannila,et al.  Random projection in dimensionality reduction: applications to image and text data , 2001, KDD '01.

[12]  L. K. Hansen,et al.  Independent Components in Text , 2000 .

[13]  Zeev Volkovich,et al.  Text mining with information-theoretic clustering , 2003, Comput. Sci. Eng..

[14]  I K Fodor,et al.  A Survey of Dimension Reduction Techniques , 2002 .

[15]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[16]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[17]  T DumaisSusan,et al.  Using linear algebra for intelligent information retrieval , 1995 .