Random direction divisive clustering

Projection methods for dimension reduction have enabled the discovery of otherwise unattainable structure in ultra high dimensional data. More recently, a particular method, namely Random Projection, has been shown to have the advantage of high quality data representations with minimal computation effort, even for data dimensions in the range of hundreds of thousands or even millions. Here, we couple this dimension reduction technique with data clustering algorithms that are specially designed for high dimensional cases. First, we show that the theoretical properties of both components can be combined in a sound manner, promising an effective clustering framework. Indeed, for a series of simulated and real ultra high dimensional data scenarios, as the experimental analysis shows, the resulting algorithms achieve high quality data partitions, orders of magnitude faster.

[1]  Vipin Kumar,et al.  The Challenges of Clustering High Dimensional Data , 2004 .

[2]  Chin-Teng Lin,et al.  LDA-Based Clustering Algorithm and Its Application to an Unsupervised Feature Extraction , 2011, IEEE Transactions on Fuzzy Systems.

[3]  Anupam Gupta,et al.  An elementary proof of the Johnson-Lindenstrauss Lemma , 1999 .

[4]  Jon M. Kleinberg,et al.  Two algorithms for nearest-neighbor search in high dimensions , 1997, STOC '97.

[5]  Dimitris K. Tasoulis,et al.  Enhancing principal direction divisive clustering , 2010, Pattern Recognit..

[6]  Daniel A. Keim,et al.  On Knowledge Discovery and Data Mining , 1997 .

[7]  Sanjeev Arora,et al.  Learning mixtures of arbitrary gaussians , 2001, STOC '01.

[8]  Dimitris Achlioptas,et al.  Database-friendly random projections , 2001, PODS.

[9]  Santosh S. Vempala,et al.  A random-sampling-based algorithm for learning intersections of halfspaces , 2010, JACM.

[10]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[11]  Martin Nilsson,et al.  Hierarchical Clustering Using Non-Greedy Principal Direction Divisive Partitioning , 2002, Information Retrieval.

[12]  Larry S. Davis,et al.  Improved fast gauss transform and efficient kernel density estimation , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[13]  Christos Boutsidis,et al.  Random Projections for $k$-means Clustering , 2010, NIPS.

[14]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[15]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[16]  Julia Hirschberg,et al.  V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure , 2007, EMNLP.

[17]  Andy Harter,et al.  Parameterisation of a stochastic model for human face identification , 1994, Proceedings of 1994 IEEE Workshop on Applications of Computer Vision.

[18]  Tony McAleavy,et al.  Introduction to Clustering Large and High-Dimensional Data , 2006 .

[19]  Emmanuel J. Candès,et al.  Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information , 2004, IEEE Transactions on Information Theory.

[20]  David L Donoho,et al.  Compressed sensing , 2006, IEEE Transactions on Information Theory.

[21]  Jacek M. Zurada,et al.  Computational Intelligence: Imitating Life , 1994 .

[22]  Rafail Ostrovsky,et al.  Efficient search for approximate nearest neighbor in high dimensional spaces , 1998, STOC '98.

[23]  Heikki Mannila,et al.  Random projection in dimensionality reduction: applications to image and text data , 2001, KDD '01.

[24]  Efstratios Gallopoulos,et al.  Principal Direction Divisive Partitioning with Kernels and k-Means Steering , 2008 .

[25]  Sanjoy Dasgupta,et al.  Learning mixtures of Gaussians , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[26]  George Bebis,et al.  Face recognition experiments with random projection , 2005, SPIE Defense + Commercial Sensing.

[27]  Sanjoy Dasgupta,et al.  Experiments with Random Projection , 2000, UAI.

[28]  N Linial,et al.  Global self-organization of all known protein sequences reveals inherent biological signatures. , 1997, Journal of molecular biology.

[29]  David J. Kriegman,et al.  From Few to Many: Illumination Cone Models for Face Recognition under Variable Lighting and Pose , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[30]  Daniel Boley,et al.  Principal Direction Divisive Partitioning , 1998, Data Mining and Knowledge Discovery.