Multi-probe random projection clustering to secure very large distributed datasets

This paper presents a solution to the approximate k-means clustering problem for very large distributed datasets. Distributed data models have gained popularity in recent years following the efforts of commercial, academic and government organizations, to make data more widely accessible. Due to the sheer volume of available data, in-memory single-core computation quickly becomes infeasible, requiring distributed multiprocessing. Our solution achieves comparable clustering performance to other popular clustering algorithms, with improved overall complexity growth while being amenable to distributed processing frameworks such as Map-Reduce. Our solution also maintains certain guarantees regarding data privacy deanonimization.

[1]  M. Saeed Multiparameter Intelligent Monitoring in Intensive Care II ( MIMIC-II ) : A public-access intensive care unit database , 2011 .

[2]  J. Leech Notes on Sphere Packings , 1967, Canadian Journal of Mathematics.

[3]  S. Dasgupta The hardness of k-means clustering , 2008 .

[4]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[5]  Dimitris Achlioptas,et al.  Database-friendly random projections , 2001, PODS.

[6]  Vitaly Shmatikov,et al.  Robust De-anonymization of Large Sparse Datasets , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[7]  J. Bourgain On lipschitz embedding of finite metric spaces in Hilbert space , 1985 .

[8]  Henry Cohn,et al.  Optimality and uniqueness of the Leech lattice among lattices , 2004, math/0403263.

[9]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[10]  Lise Getoor,et al.  Collective Classification in Network Data , 2008, AI Mag..

[11]  Nir Ailon,et al.  An almost optimal unrestricted fast Johnson-Lindenstrauss transform , 2010, SODA '11.

[12]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[13]  Jay Magidson,et al.  Latent class models for clustering : a comparison with K-means , 2002 .

[14]  Feng-Wen Sun,et al.  The Leech lattice, the octacode, and decoding algorithms , 1995, IEEE Trans. Inf. Theory.

[15]  Steven Fortune,et al.  A sweepline algorithm for Voronoi diagrams , 1986, SCG '86.

[16]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[17]  Nicholas Kushmerick,et al.  Learning to remove Internet advertisements , 1999, AGENTS '99.

[18]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[19]  Rina Panigrahy,et al.  Entropy based nearest neighbor search in high dimensions , 2005, SODA '06.

[20]  Yehuda Lindell,et al.  Privacy Preserving Data Mining , 2000, Journal of Cryptology.

[21]  Heikki Mannila,et al.  Random projection in dimensionality reduction: applications to image and text data , 2001, KDD '01.

[22]  Alexander Vardy,et al.  Closest point search in lattices , 2002, IEEE Trans. Inf. Theory.

[23]  Alfonso Valencia,et al.  A hierarchical unsupervised growing neural network for clustering gene expression patterns , 2001, Bioinform..

[24]  Alexander Vardy Even more efficient bounded-distance decoding of the hexacode, the Golay code, and the Leech lattice , 1995, IEEE Trans. Inf. Theory.

[25]  Andrea Vattani,et al.  k-means Requires Exponentially Many Iterations Even in the Plane , 2008, SCG '09.

[26]  Rolf Klein,et al.  Abstract Voronoi Diagrams and their Applications , 1988, Workshop on Computational Geometry.

[27]  U. V. Luxburg,et al.  Towards a Statistical Theory of Clustering , 2005 .

[28]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[29]  Eran Halperin,et al.  Identifying Personal Genomes by Surname Inference , 2013, Science.

[30]  W. Cary Huffman,et al.  Fundamentals of Error-Correcting Codes , 1975 .

[31]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[32]  Peter Frankl,et al.  The Johnson-Lindenstrauss lemma and the sphericity of some graphs , 1987, J. Comb. Theory B.

[33]  Chabane Djeraba,et al.  Clustering by Random Projections , 2007, ICDM.

[34]  T. H. Kyaw,et al.  Multiparameter Intelligent Monitoring in Intensive Care II: A public-access intensive care unit database* , 2011, Critical care medicine.

[35]  Inderjit S. Dhillon,et al.  A Data-Clustering Algorithm on Distributed Memory Multiprocessors , 1999, Large-Scale Parallel Data Mining.

[36]  Carla E. Brodley,et al.  Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach , 2003, ICML.

[37]  Meena Mahajan,et al.  The Planar k-means Problem is NP-hard I , 2009 .

[38]  Christos Boutsidis,et al.  Random Projections for $k$-means Clustering , 2010, NIPS.

[39]  Leonard J. Schulman,et al.  Dimensionality reduction: beyond the Johnson-Lindenstrauss bound , 2011, SODA '11.

[40]  Amnon Shashua,et al.  A unifying approach to hard and probabilistic clustering , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[41]  R. T. Curtis,et al.  A new combinatorial approach to M24 , 1976, Mathematical Proceedings of the Cambridge Philosophical Society.

[42]  Sanjoy Dasgupta,et al.  Experiments with Random Projection , 2000, UAI.

[43]  Roberto Avogadri,et al.  Fuzzy ensemble clustering based on random projections for DNA microarray data analysis , 2009, Artif. Intell. Medicine.

[44]  Hanan Samet,et al.  Foundations of multidimensional and metric data structures , 2006, Morgan Kaufmann series in data management systems.

[45]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[46]  Elias Oliveira,et al.  Agglomeration and Elimination of Terms for Dimensionality Reduction , 2009, 2009 Ninth International Conference on Intelligent Systems Design and Applications.

[47]  Nir Ailon,et al.  Streaming k-means approximation , 2009, NIPS.

[48]  Ofer Amrani,et al.  The Leech lattice and the Golay code: bounded-distance decoding and multilevel constructions , 1994, IEEE Trans. Inf. Theory.

[49]  Hillol Kargupta,et al.  Distributed Data Mining: Algorithms, Systems, and Applications , 2003 .

[50]  Marina L. Gavrilova,et al.  An Explicit Solution for Computing the Euclidean -dimensional Voronoi Diagram of Spheres in a Floating-Point Arithmetic , 2003, ICCSA.

[51]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[52]  Jiawei Han,et al.  CLARANS: A Method for Clustering Objects for Spatial Data Mining , 2002, IEEE Trans. Knowl. Data Eng..

[53]  Carlotta Domeniconi,et al.  Weighted Clustering Ensembles , 2006, SDM.

[54]  Vahid Tarokh,et al.  Trellis complexity versus the coding gain of lattices I , 1996, IEEE Trans. Inf. Theory.

[55]  Ming Gu,et al.  A Brief Survey on De-anonymization Attacks in Online Social Networks , 2010, 2010 International Conference on Computational Aspects of Social Networks.

[56]  Alexandr Andoni,et al.  Nearest neighbor search : the old, the new, and the impossible , 2009 .

[57]  Anirban Dasgupta,et al.  A sparse Johnson: Lindenstrauss transform , 2010, STOC '10.

[58]  Santosh S. Vempala,et al.  The Random Projection Method , 2005, DIMACS Series in Discrete Mathematics and Theoretical Computer Science.

[59]  Masato Oguchi,et al.  Parallel Database Processing on a 100 Node PC Cluster: Cases for Decision Support Query Processing and Data Mining , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[60]  Bernard Chazelle,et al.  Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform , 2006, STOC '06.