Local-Density Subspace Distributed Clustering for High-Dimensional Data

Distributed clustering is emerging along with the advent of the era of big data. However, most existing established distributed clustering methods focus on problems caused by a large amount of data rather than caused by the large dimension of data. Consequently, they suffer the “curse” of dimensionality (e.g., poor performance and heavy network overhead) when high-dimensional (HD) data are clustered. In this article, we propose a distributed algorithm, referred to as Local Density Subspace Distributed Clustering (LDSDC) algorithm, to cluster large-scale HD data, motivated by the idea that a local dense region of a HD dataset is usually distributed in a low-dimensional (LD) subspace. LDSDC follows a local-global-local processing structure, including grouping of local dense regions (atom clusters) followed by subspace Gaussian model (SGM) fitting (flexible and scalable to data dimension) at each sub-site, merging of atom clusters at every sub-site according to the merging result broadcast from the global site. Moreover, we propose a fast method to estimate the parameters of SGM for HD data, together with its convergence proof. We evaluate LDSDC on both synthetic and real datasets and compare it with four state-of-the-art methods. The experimental results demonstrate that the proposed LDSDC yields best overall performance.

[1]  Dimitrios Gunopulos,et al.  Automatic Subspace Clustering of High Dimensional Data , 2005, Data Mining and Knowledge Discovery.

[2]  Sun Zhi-hui,et al.  Local Density Based Distributed Clustering Algorithm , 2008 .

[3]  Sharad Mehrotra,et al.  Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces , 2000, VLDB.

[4]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[5]  M E J Newman,et al.  Modularity and community structure in networks. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[7]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[8]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[9]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[10]  Gong Shufeng Zhang Yanfeng,et al.  EDDPC: An Efficient Distributed Density Peaks Clustering Algorithm , 2016 .

[11]  Bo Yuan,et al.  Efficient distributed clustering using boundary information , 2018, Neurocomputing.

[12]  Dongkyoo Shin,et al.  Integration of Distributed Biological Data Using Modified K-Means Algorithm , 2007, PAKDD Workshops.

[13]  Ge Yu,et al.  Efficient Distributed Density Peaks for Clustering Large Data Sets in MapReduce , 2016, IEEE Transactions on Knowledge and Data Engineering.

[14]  Justin Zhijun Zhan,et al.  Data mining in distributed environment: a survey , 2017, WIREs Data Mining Knowl. Discov..

[15]  Gunnar Rätsch,et al.  Kernel PCA and De-Noising in Feature Spaces , 1998, NIPS.

[16]  Jian Pei,et al.  Data Mining : Concepts and Techniques 3rd edition Ed. 3 , 2011 .

[17]  Sergei Vassilvitskii,et al.  Scalable K-Means++ , 2012, Proc. VLDB Endow..

[18]  Charles R. Johnson,et al.  Matrix Analysis, 2nd Ed , 2012 .

[19]  Genlin Ji,et al.  Ensemble Learning Based Distributed Clustering , 2007, PAKDD Workshops.

[20]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[21]  Feiping Nie,et al.  Spectral Rotation versus K-Means in Spectral Clustering , 2013, AAAI.

[22]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[23]  Ira Assent,et al.  Clustering high dimensional data , 2012 .

[24]  Alok N. Choudhary,et al.  Adaptive Grids for Clustering Massive Data Sets , 2001, SDM.

[25]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[26]  K. Thangavel,et al.  Distributed Data Clustering: A Comparative Analysis , 2009, Foundations of Computational Intelligence.

[27]  S. Shankar Sastry,et al.  Generalized principal component analysis (GPCA) , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Chang-Dong Wang,et al.  Graph-Based Multiprototype Competitive Learning and Its Applications , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[29]  Hans-Peter Kriegel,et al.  Effective and efficient distributed model-based clustering , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[30]  Ling Li,et al.  Distributed data mining: a survey , 2012, Inf. Technol. Manag..

[31]  T. Lumley,et al.  PRINCIPAL COMPONENT ANALYSIS AND FACTOR ANALYSIS , 2004, Statistical Methods for Biomedical Research.

[32]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[33]  Hans-Peter Kriegel,et al.  DBDC: Density Based Distributed Clustering , 2004, EDBT.

[34]  Zahir Tari,et al.  A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis , 2014, IEEE Transactions on Emerging Topics in Computing.

[35]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[36]  Charles Bouveyron,et al.  Model-based clustering of high-dimensional data: A review , 2014, Comput. Stat. Data Anal..

[37]  Michael Rabadi,et al.  Kernel Methods for Machine Learning , 2015 .

[38]  Hans-Peter Kriegel,et al.  Scalable Density-Based Distributed Clustering , 2004, PKDD.

[39]  Manpreet Singh Bajwa,et al.  Ternary search algorithm: Improvement of binary search , 2015, 2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom).

[40]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[41]  Zhi Wei,et al.  REMOLD: An Efficient Model-Based Clustering Algorithm for Large Datasets with Spark , 2017, 2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS).

[42]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[43]  Hans-Peter Kriegel,et al.  A Fast Parallel Clustering Algorithm for Large Spatial Databases , 1999, Data Mining and Knowledge Discovery.

[44]  T. M. Murali,et al.  A Monte Carlo algorithm for fast projective clustering , 2002, SIGMOD '02.

[45]  Hans-Peter Kriegel,et al.  Density-Connected Subspace Clustering for High-Dimensional Data , 2004, SDM.

[46]  Yi Zhang,et al.  Entropy-based subspace clustering for mining numerical data , 1999, KDD '99.

[47]  Haofeng Zhang,et al.  Clustering-driven unsupervised deep hashing for image retrieval , 2019, Neurocomputing.

[48]  Massimo Panella,et al.  Recent Advances on Distributed Unsupervised Learning , 2015, Advances in Neural Networks.

[49]  Cordelia Schmid,et al.  High-dimensional data clustering , 2006, Comput. Stat. Data Anal..

[50]  Feiping Nie,et al.  Optimal Mean Robust Principal Component Analysis , 2014, ICML.

[51]  Chris Clifton,et al.  Privacy-preserving k-means clustering over vertically partitioned data , 2003, KDD '03.

[52]  Mauro Maggioni,et al.  Unsupervised Clustering and Active Learning of Hyperspectral Images With Nonlinear Diffusion , 2017, IEEE Transactions on Geoscience and Remote Sensing.

[53]  Wei-Chiang Li,et al.  Convex Optimization for Signal Processing and Communications: From Fundamentals to Applications , 2017 .

[54]  Mohamed S. Kamel,et al.  Models of distributed data clustering in peer-to-peer environments , 2012, Knowledge and Information Systems.

[55]  Rong Zheng,et al.  RECOME: a New Density-Based Clustering Algorithm Using Relative KNN Kernel Density , 2016, Inf. Sci..

[56]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[57]  Yingyu Liang,et al.  Distributed k-Means and k-Median Clustering on General Topologies , 2013, NIPS 2013.

[58]  S. Canu,et al.  Training Invariant Support Vector Machines using Selective Sampling , 2005 .

[59]  Millie Pant,et al.  Link based BPSO for feature selection in big data text clustering , 2017, Future Gener. Comput. Syst..

[60]  Sean Hughes,et al.  Clustering by Fast Search and Find of Density Peaks , 2016 .

[61]  Chris Clifton,et al.  Privacy-preserving clustering with distributed EM mixture modeling , 2004, Knowledge and Information Systems.