A feature group weighting method for subspace clustering of high-dimensional data

This paper proposes a new method to weight subspaces in feature groups and individual features for clustering high-dimensional data. In this method, the features of high-dimensional data are divided into feature groups, based on their natural characteristics. Two types of weights are introduced to the clustering process to simultaneously identify the importance of feature groups and individual features in each cluster. A new optimization model is given to define the optimization process and a new clustering algorithm FG-k-means is proposed to optimize the optimization model. The new algorithm is an extension to k-means by adding two additional steps to automatically calculate the two types of subspace weights. A new data generation method is presented to generate high-dimensional data with clusters in subspaces of both feature groups and individual features. Experimental results on synthetic and real-life data have shown that the FG-k-means algorithm significantly outperformed four k-means type algorithms, i.e., k-means, W-k-means, LAC and EWKM in almost all experiments. The new algorithm is robust to noise and missing values which commonly exist in high-dimensional data.

[1]  Chieh-Yuan Tsai,et al.  Developing a feature weight self-adjustment mechanism for a K-means clustering algorithm , 2008, Comput. Stat. Data Anal..

[2]  Zhaohong Deng,et al.  Enhanced soft subspace clustering integrating within-cluster and between-cluster information , 2010, Pattern Recognit..

[3]  Mohamed Zaït,et al.  A comparative study of clustering methods , 1997, Future Gener. Comput. Syst..

[4]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[5]  Philip S. Yu,et al.  Finding generalized projected clusters in high dimensional spaces , 2000, SIGMOD '00.

[6]  Sharad Mehrotra,et al.  Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces , 2000, VLDB.

[7]  T. M. Murali,et al.  A Monte Carlo algorithm for fast projective clustering , 2002, SIGMOD '02.

[8]  G. W. Milligan,et al.  A validation study of a variable weighting algorithm for cluster analysis , 1989 .

[9]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[10]  P. Green,et al.  A preliminary study of optimal variable weighting in k-means clustering , 1990 .

[11]  Michael K. Ng,et al.  An optimization algorithm for clustering using weighted dissimilarity measures , 2004, Pattern Recognit..

[12]  Gregory Piatetsky-Shapiro,et al.  High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality , 2000 .

[13]  Hichem Frigui,et al.  Unsupervised learning of prototypes and attribute weights , 2004, Pattern Recognit..

[14]  J. Friedman,et al.  Clustering objects on subsets of attributes (with discussion) , 2004 .

[15]  Peter D. Hoff,et al.  Model-based subspace clustering , 2006 .

[16]  G. W. Milligan,et al.  The validation of four ultrametric clustering algorithms , 1980, Pattern Recognit..

[17]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[18]  Michael K. Ng,et al.  An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data , 2007, IEEE Transactions on Knowledge and Data Engineering.

[19]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[20]  Kien A. Hua,et al.  Constrained locally weighted clustering , 2008, Proc. VLDB Endow..

[21]  W. Scott Spangler,et al.  Feature Weighting in k-Means Clustering , 2003, Machine Learning.

[22]  Dimitrios Gunopulos,et al.  Locally adaptive metrics for clustering high dimensional data , 2007, Data Mining and Knowledge Discovery.

[23]  Hichem Frigui,et al.  Simultaneous Clustering and Dynamic Keyword Weighting for Text Documents , 2004 .

[24]  King-Sun Fu,et al.  Automated classification of nucleated blood cells using a binary tree classifier , 1980, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Michael K. Ng,et al.  Automated variable weighting in k-means type clustering , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Dimitrios Gunopulos,et al.  Subspace Clustering of High Dimensional Data , 2004, SDM.

[27]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[28]  J. Friedman Clustering objects on subsets of attributes , 2002 .

[29]  Michael K. Ng,et al.  HARP: a practical projected clustering algorithm , 2004, IEEE Transactions on Knowledge and Data Engineering.

[30]  J. Carroll,et al.  Synthesized clustering: A method for amalgamating alternative clustering bases with differential weighting of variables , 1984 .

[31]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[32]  Cordelia Schmid,et al.  High-dimensional data clustering , 2006, Comput. Stat. Data Anal..

[33]  Michael K. Ng,et al.  On discovery of extremely low-dimensional clusters using semi-supervised projected clustering , 2005, 21st International Conference on Data Engineering (ICDE'05).