Segregation-based subspace clustering for huge dimensional data

Clustering algorithms break down when the data points fall in huge-dimensional spaces. To tackle this problem, many subspace clustering methods were proposed to build up a subspace where data points cluster efficiently. The bottom-up approach is used widely to select a set of candidate features, and then to use a portion of this set to build up the hidden subspace step by step. The complexity depends exponentially or cubically on the number of the selected features. In this paper, we present SEGCLU, a SEGregation-based subspace CLUstering method which significantly reduces the size of the candidate features' set and has a cubic complexity. This algorithm was applied at noise-free data of DNA copy numbers of two groups of autistic and typically developing children to extract a potential bio-marker for autism. 85% of the individuals were classified correctly in a 13-dimensional subspace.

[1]  Ahmed H. Tewfik,et al.  DCN variations for autism diagnosis , 2010, 2010 4th International Symposium on Communications, Control and Signal Processing (ISCCSP).

[2]  Hans-Peter Kriegel,et al.  A generic framework for efficient subspace clustering of high-dimensional data , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[3]  Christian Böhm,et al.  Density connected clustering with local subspace preferences , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[4]  Yi Zhang,et al.  Entropy-based subspace clustering for mining numerical data , 1999, KDD '99.

[5]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[6]  Hans-Peter Kriegel,et al.  Density-Connected Subspace Clustering for High-Dimensional Data , 2004, SDM.