SISC: A Text Classification Approach Using Semi Supervised Subspace Clustering

Text classification poses some specific challenges. One such challenge is its high dimensionality where each document (data point) contains only a small subset of them. In this paper, we propose Semi-supervised Impurity based Subspace Clustering (SISC) in conjunction with k-Nearest Neighbor approach, based on semi-supervised subspace clustering that considers the high dimensionality as well as the sparse nature of them in text data. SISC finds clusters in the subspaces of the high dimensional text data where each text document has fuzzy cluster membership. This fuzzy clustering exploits two factors - chi square statistic of the dimensions and the impurity measure within each cluster. Empirical evaluation on real world data sets reveals the effectiveness of our approach as it significantly outperforms other state-of-the-art text classification and subspace clustering algorithms.

[1]  Jinyan Li,et al.  Distance Based Subspace Clustering with Flexible Dimension Partitioning , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[2]  Geoff Holmes,et al.  Multi-label Classification Using Ensembles of Pruned Sets , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[3]  Zhi-Hua Zhou,et al.  ML-KNN: A lazy learning approach to multi-label learning , 2007, Pattern Recognit..

[4]  Volker Tresp,et al.  Multi-label informed latent semantic indexing , 2005, SIGIR '05.

[5]  Michael K. Ng,et al.  An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data , 2007, IEEE Transactions on Knowledge and Data Engineering.

[6]  Hichem Frigui,et al.  Unsupervised learning of prototypes and attribute weights , 2004, Pattern Recognit..

[7]  Saso Dzeroski,et al.  Clustering Trees with Instance Level Constraints , 2007, ECML.

[8]  Bhavani M. Thuraisingham,et al.  A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[9]  Jiawei Han,et al.  Document clustering using locality preserving indexing , 2005, IEEE Transactions on Knowledge and Data Engineering.

[10]  Michael K. Ng,et al.  HARP: a practical projected clustering algorithm , 2004, IEEE Transactions on Knowledge and Data Engineering.

[11]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[12]  Yi Zhang,et al.  Entropy-based subspace clustering for mining numerical data , 1999, KDD '99.

[13]  Grigorios Tsoumakas,et al.  Random k -Labelsets: An Ensemble Method for Multilabel Classification , 2007, ECML.

[14]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[15]  Raymond J. Mooney,et al.  Integrating constraints and metric learning in semi-supervised clustering , 2004, ICML.

[16]  Thomas Hofmann,et al.  Support vector machine learning for interdependent and structured output spaces , 2004, ICML.