FastMap in dimensionality reduction: ensemble clustering of high dimensional data

In this paper we propose an ensemble clustering method for high dimensional data which uses FastMap projection (FP) to generate component datasets. In comparison with subspace component data generation methods such as random sampling (RS), random projection (RP) and principal component analysis (PCA), FP can better preserve the clustering structure of the original data in the component datasets so that the performance of ensemble clustering can be improved significantly. We present experiment results on six real world high dimensional datasets to demonstrate the better preservation of the clustering structure of the original data in the component datasets generated with FastMap, in comparison with the component datasets generated with RS, RP and PCA. The experiment results of 12 ensemble clustering methods from combinations of four subspace component data generation methods and three consensus functions also demonstrated that the ensemble clustering methods with FastMap outperformed other ensemble clustering methods with RS, RP and PCA. Ensemble clustering with FastMap also performed better than the k-means clustering algorithm.

[1]  Ludmila I. Kuncheva,et al.  Using diversity in cluster ensembles , 2004, 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583).

[2]  Sanjoy Dasgupta,et al.  Experiments with Random Projection , 2000, UAI.

[3]  Carla E. Brodley,et al.  Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach , 2003, ICML.

[4]  Qiang Yang,et al.  Discriminatively regularized least-squares classification , 2009, Pattern Recognit..

[5]  Rong Jin,et al.  Recovering the Optimal Solution by Dual Random Projection , 2012, COLT.

[6]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[7]  Gonzalo Navarro,et al.  A Probabilistic Spell for the Curse of Dimensionality , 2001, ALENEX.

[8]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[9]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[10]  J. Edward Jackson,et al.  A User's Guide to Principal Components. , 1991 .

[11]  Carlotta Domeniconi,et al.  Weighted cluster ensembles: Methods and analysis , 2009, TKDD.

[12]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[13]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[14]  Ludmila I. Kuncheva,et al.  Evaluation of Stability of k-Means Cluster Ensembles with Respect to Random Initialization , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Cherukuri Aswani Kumar,et al.  Reducing data dimensionality using random projections and fuzzy k-means clustering , 2011, Int. J. Intell. Comput. Cybern..

[16]  Anil K. Jain,et al.  Multiobjective data clustering , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[17]  Shashi Shekhar,et al.  Multilevel hypergraph partitioning: applications in VLSI domain , 1999, IEEE Trans. Very Large Scale Integr. Syst..

[18]  Christos Faloutsos,et al.  FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets , 1995, SIGMOD '95.

[19]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[20]  Heikki Mannila,et al.  Random projection in dimensionality reduction: applications to image and text data , 2001, KDD '01.

[21]  Dimitris Achlioptas,et al.  Database-friendly random projections , 2001, PODS.

[22]  J. Lindenstrauss,et al.  Extensions of lipschitz maps into Banach spaces , 1986 .