Parallel boosted clustering

Abstract Scalability of clustering algorithms is a critical issue in real world clustering applications. Usually, data sampling and parallelization are two common ways to address the scalability issue. Despite their wide utilization in a number of clustering algorithms, they suffer from several major drawbacks. For example, most data sampling can often lead to biased solutions due to its inability in accurately capturing the distribution of the entire data set. On the other hand, the performance of parallelization highly depends on the original clustering routines which are not parallel algorithms in nature, such that customizing each algorithm to be parallel may hurt the clustering performance. To alleviate these problems, we propose a general two-step framework for scalable clustering in this work, where the first step is to obtain skeleton structure of data and the second step is to obtain the final clustering. Concretely, data are first partitioned and located across a two-dimensional grid, and then local clustering algorithms are iteratively applied on the cells of the grid, each providing a set of intermediate core points. These core points represent the dense or central regions of data, which can be centers, modes and means for centroid-based, density-based and probability-based clustering, respectively. Finally, these core points are further used to obtain the final clustering. The proposed framework enjoys several benefits: (1) the local clustering on partitioned cells are conducted in parallel and thus can lead to high speed-up; (2) the clustering on the representative core points can be more robust; (3) the framework can be easily applied to other basic clustering methods and thus achieves a general scalable solution. Theoretical analysis is provided and extensive experimental results have demonstrated the effectiveness and efficiency of the proposed framework.

[1]  Kenneth A. De Jong,et al.  An Analysis of the Effects of Neighborhood Size and Shape on Local Selection Algorithms , 1996, PPSN.

[2]  Aristides Gionis,et al.  Clustering aggregation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[3]  Wei-keng Liao,et al.  A new scalable parallel DBSCAN algorithm using the disjoint-set data structure , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Xiaobo Li,et al.  Parallel clustering algorithms , 1989, Parallel Comput..

[5]  Larry D. Hostetler,et al.  The estimation of the gradient of a density function, with applications in pattern recognition , 1975, IEEE Trans. Inf. Theory.

[6]  Yizong Cheng,et al.  Mean Shift, Mode Seeking, and Clustering , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Uday Kamath,et al.  Boosted Mean Shift Clustering , 2014, ECML/PKDD.

[8]  Olatz Arbelaitz,et al.  An extensive comparative study of cluster validity indices , 2013, Pattern Recognit..

[9]  Kenneth A. De Jong,et al.  A Spatial EA Framework for Parallelizing Machine Learning Methods , 2012, PPSN.

[10]  Yazhou Ren Big data clustering and its applications in regional science , 2017 .

[11]  Zenglin Xu,et al.  Bayesian Nonparametric Models for Multiway Data Analysis , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Zenglin Xu,et al.  Self-Paced Multi-Task Clustering , 2018, Neurocomputing.

[13]  Zenglin Xu,et al.  Auto-weighted multi-view clustering via kernelized graph learning , 2019, Pattern Recognit..

[14]  Dorin Comaniciu,et al.  Mean Shift: A Robust Approach Toward Feature Space Analysis , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Zenglin Xu,et al.  Semi-supervised deep embedded clustering , 2019, Neurocomputing.

[16]  Chin-Teng Lin,et al.  A review of clustering techniques and developments , 2017, Neurocomputing.

[17]  Miguel Á. Carreira-Perpiñán,et al.  On the Number of Modes of a Gaussian Mixture , 2003, Scale-Space.

[18]  Zenglin Xu,et al.  Adaptive local structure learning for document co-clustering , 2018, Knowl. Based Syst..

[19]  Sungzoon Cho,et al.  Bag-of-concepts: Comprehending document representation through clustering words in distributed representation , 2017, Neurocomputing.

[20]  D. Comaniciu,et al.  The variable bandwidth mean shift and data-driven scale selection , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[21]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[22]  Dimitrios Gunopulos,et al.  Locally adaptive metrics for clustering high dimensional data , 2007, Data Mining and Knowledge Discovery.

[23]  Yu-Bo Yang,et al.  An Efficient Parallel Nonlinear Clustering Algorithm Using MapReduce , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[24]  Csaba Legány,et al.  Cluster validity measurement techniques , 2006 .

[25]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[26]  Zenglin Xu,et al.  Variational Random Function Model for Network Modeling , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[27]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[28]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[29]  Qing He,et al.  Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[30]  Hans-Peter Kriegel,et al.  Density‐based clustering , 2011, WIREs Data Mining Knowl. Discov..

[31]  Zenglin Xu,et al.  Semi-supervised DenPeak Clustering with Pairwise Constraints , 2018, PRICAI.

[32]  Carlotta Domeniconi,et al.  Weighted-Object Ensemble Clustering , 2013, 2013 IEEE 13th International Conference on Data Mining.

[33]  Kenneth A. De Jong,et al.  Theoretical and Empirical Analysis of a Spatial EA Parallel Boosting Algorithm , 2016, Evolutionary Computation.

[34]  Shai Avidan Ensemble Tracking , 2007, IEEE Trans. Pattern Anal. Mach. Intell..

[35]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[36]  Zenglin Xu,et al.  Robust multi-view data clustering with multi-view capped-norm K-means , 2018, Neurocomputing.

[37]  Zenglin Xu,et al.  Self-paced and soft-weighted nonnegative matrix factorization for data representation , 2019, Knowl. Based Syst..

[38]  D. Massart,et al.  Looking for natural patterns in data: Part 1. Density-based approach , 2001 .

[39]  M. Tahar Kechadi,et al.  A multi-act sequential game-based multi-objective clustering approach for categorical data , 2017, Neurocomputing.

[40]  Zenglin Xu,et al.  Self-weighted multi-view clustering with soft capped norm , 2018, Knowl. Based Syst..

[41]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[42]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[43]  Zenglin Xu,et al.  Robust graph regularized nonnegative matrix factorization for clustering , 2017, Data Mining and Knowledge Discovery.

[44]  Peter J. Rousseeuw,et al.  Clustering by means of medoids , 1987 .

[45]  Carlotta Domeniconi,et al.  Weighted-object ensemble clustering: methods and analysis , 2016, Knowledge and Information Systems.

[46]  Kenneth A. De Jong,et al.  An analysis of a spatial EA parallel boosting algorithm , 2013, GECCO '13.

[47]  Alessandro Laio,et al.  Clustering by fast search and find of density peaks , 2014, Science.

[48]  Carlotta Domeniconi,et al.  A Weighted Adaptive Mean Shift Clustering Algorithm , 2014, SDM.

[49]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[50]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[51]  Hongqing Zhu,et al.  Merging Student's-t and Rayleigh distributions regression mixture model for clustering time-series , 2017, Neurocomputing.

[52]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[53]  Zenglin Xu,et al.  Deep Density-based Image Clustering , 2018, Knowl. Based Syst..