Enhanced soft subspace clustering integrating within-cluster and between-cluster information

While within-cluster information is commonly utilized in most soft subspace clustering approaches in order to develop the algorithms, other important information such as between-cluster information is seldom considered for soft subspace clustering. In this study, a novel clustering technique called enhanced soft subspace clustering (ESSC) is proposed by employing both within-cluster and between-class information. First, a new optimization objective function is developed by integrating the within-class compactness and the between-cluster separation in the subspace. Based on this objective function, the corresponding update rules for clustering are then derived, followed by the development of the novel ESSC algorithm. The properties of this algorithm are investigated and the performance is evaluated experimentally using real and synthetic datasets, including synthetic high dimensional datasets, UCI benchmarking datasets, high dimensional cancer gene expression datasets and texture image datasets. The experimental studies demonstrate that the accuracy of the proposed ESSC algorithm outperforms most existing state-of-the-art soft subspace clustering algorithms.

[1]  Zijiang Yang,et al.  A Fuzzy Subspace Algorithm for Clustering High Dimensional Data , 2006, ADMA.

[2]  Jianhong Wu,et al.  A convergence theorem for the fuzzy subspace clustering (FSC) algorithm , 2008, Pattern Recognit..

[3]  W. T. Tucker,et al.  Convergence theory for fuzzy c-means: Counterexamples and repairs , 1987, IEEE Transactions on Systems, Man, and Cybernetics.

[4]  Hichem Frigui,et al.  Simultaneous Clustering and Dynamic Keyword Weighting for Text Documents , 2004 .

[5]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[6]  Zhaohong Deng,et al.  Robust maximum entropy clustering algorithm with its labeling for outliers , 2006, Soft Comput..

[7]  Jian Yu,et al.  Optimality test for generalized FCM and its application to parameter selection , 2005, IEEE Transactions on Fuzzy Systems.

[8]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[9]  Hichem Frigui,et al.  Unsupervised learning of prototypes and attribute weights , 2004, Pattern Recognit..

[10]  Michael W. Berry,et al.  Survey of Text Mining , 2003, Springer New York.

[11]  Sanjay Ranka,et al.  Gene expression Distance-based clustering of CGH data , 2006 .

[12]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[13]  Myoung-Ho Kim,et al.  FINDIT: a fast and intelligent subspace clustering algorithm using dimension voting , 2004, Inf. Softw. Technol..

[14]  Jian Yu,et al.  A novel fuzzy clustering algorithm based on a fuzzy scatter matrix with optimality tests , 2005, Pattern Recognit. Lett..

[15]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[16]  E. Lander,et al.  Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Jianhong Wu,et al.  Projective ART for clustering data sets in high dimensional spaces , 2002, Neural Networks.

[18]  Joni-Kristian Kämäräinen,et al.  Simple Gabor feature space for invariant object recognition , 2004, Pattern Recognit. Lett..

[19]  J. Friedman,et al.  Clustering objects on subsets of attributes (with discussion) , 2004 .

[20]  Christian A. Rees,et al.  Molecular portraits of human breast tumours , 2000, Nature.

[21]  J. Welsh,et al.  Analysis of gene expression identifies candidate markers and pharmacological targets in prostate cancer. , 2001, Cancer research.

[22]  Dimitrios Gunopulos,et al.  Locally adaptive metrics for clustering high dimensional data , 2007, Data Mining and Knowledge Discovery.

[23]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[24]  Carlotta Domeniconi,et al.  Weighted cluster ensembles: Methods and analysis , 2009, TKDD.

[25]  Steffen Staab,et al.  Ontology-based Text Document Clustering , 2002, Künstliche Intell..

[26]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[27]  Jian Yu,et al.  Analysis of the weighting exponent in the FCM , 2004, IEEE Trans. Syst. Man Cybern. Part B.

[28]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[29]  Vladimir Makarenkov,et al.  Optimal Variable Weighting for Ultrametric and Additive Trees and K-means Partitioning: Methods and Software , 2001, J. Classif..

[30]  Yi Zhang,et al.  Entropy-based subspace clustering for mining numerical data , 1999, KDD '99.

[31]  Michael K. Ng,et al.  Subspace Clustering of Text Documents with Feature Weighting K-Means Algorithm , 2005, PAKDD.

[32]  G. Soete Optimal variable weighting for ultrametric and additive tree clustering , 1986 .

[33]  Swagatam Das,et al.  Automatic Clustering Using an Improved Differential Evolution Algorithm , 2007 .

[34]  Michael K. Ng,et al.  HARP: a practical projected clustering algorithm , 2004, IEEE Transactions on Knowledge and Data Engineering.

[35]  Philip S. Yu,et al.  /spl delta/-clusters: capturing subspace correlation in a large data set , 2002, Proceedings 18th International Conference on Data Engineering.

[36]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[37]  T. M. Murali,et al.  A Monte Carlo algorithm for fast projective clustering , 2002, SIGMOD '02.

[38]  Jacek M. Leski,et al.  Towards a robust fuzzy clustering , 2003, Fuzzy Sets Syst..

[39]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[40]  Sharad Mehrotra,et al.  Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces , 2000, VLDB.

[41]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[42]  G. Soete OVWTRE: A program for optimal variable weighting for ultrametric and additive tree fitting , 1988 .

[43]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[44]  Michael K. Ng,et al.  An optimization algorithm for clustering using weighted dissimilarity measures , 2004, Pattern Recognit..

[45]  Michael K. Ng,et al.  Automated variable weighting in k-means type clustering , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Dimitrios Gunopulos,et al.  Subspace Clustering of High Dimensional Data , 2004, SDM.

[47]  Peter W. Eklund,et al.  A study of parameter values for a Mahalanobis Distance fuzzy classifier , 2003, Fuzzy Sets Syst..

[48]  Philip S. Yu,et al.  Finding generalized projected clusters in high dimensional spaces , 2000, SIGMOD 2000.

[49]  Michael K. Ng,et al.  An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data , 2007, IEEE Transactions on Knowledge and Data Engineering.