Distributed data clustering over networks

Abstract In this paper, we consider the problem of distributed unsupervised clustering, where training data is partitioned over a set of agents, whose interaction happens over a sparse, but connected, communication network. To solve this problem, we recast the well known Expectation Maximization method in a distributed setting, exploiting a recently proposed algorithmic framework for in-network non-convex optimization. The resulting algorithm, termed as Expectation Maximization Consensus, exploits successive local convexifications to split the computation among agents, while hinging on dynamic consensus to diffuse information over the network in real-time. Convergence to local solutions of the distributed clustering problem is then established. Experimental results on well-known datasets illustrate that the proposed method performs better than other distributed Expectation-Maximization clustering approaches, while the method is faster than a centralized Expectation-Maximization procedure and achieves a comparable performance in terms of cluster validity indexes. The latter ones achieve good values in absolute range scales and prove the quality of the obtained clustering results, which compare favorably with other methods in the literature.

[1]  Enrique H. Ruspini,et al.  A New Approach to Clustering , 1969, Inf. Control..

[2]  K Lehnertz,et al.  Indications of nonlinear deterministic and finite-dimensional structures in time series of brain electrical activity: dependence on recording region and brain state. , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.

[3]  H. D. Brunk,et al.  The Isotonic Regression Problem and its Dual , 1972 .

[4]  Bo Yuan,et al.  Efficient distributed clustering using boundary information , 2018, Neurocomputing.

[5]  J.N. Tsitsiklis,et al.  Convergence in Multiagent Coordination, Consensus, and Flocking , 2005, Proceedings of the 44th IEEE Conference on Decision and Control.

[6]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[7]  Somesh Jha,et al.  Privacy Preserving Clustering , 2005, ESORICS.

[8]  Robert D. Nowak,et al.  Distributed EM algorithms for density estimation and clustering in sensor networks , 2003, IEEE Trans. Signal Process..

[9]  Hedieh Sajedi,et al.  Peer sampling gossip-based distributed clustering algorithm for unstructured P2P networks , 2017, Neural Computing and Applications.

[10]  T. Moon The expectation-maximization algorithm , 1996, IEEE Signal Process. Mag..

[11]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[12]  Dongbing Gu,et al.  Distributed EM Algorithm for Gaussian Mixtures in Sensor Networks , 2008, IEEE Transactions on Neural Networks.

[13]  Massimo Panella,et al.  A Distributed Algorithm for the Cooperative Prediction of Power Production in PV Plants , 2019, IEEE Transactions on Energy Conversion.

[14]  Nikola K. Kasabov,et al.  DENFIS: dynamic evolving neural-fuzzy inference system and its application for time-series prediction , 2002, IEEE Trans. Fuzzy Syst..

[15]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[16]  Kilian Stoffel,et al.  Parallel k/h-Means Clustering for Large Data Sets , 1999, Euro-Par.

[17]  Massimo Panella,et al.  Recent Advances on Distributed Unsupervised Learning , 2015, Advances in Neural Networks.

[18]  Gesualdo Scutari,et al.  NEXT: In-Network Nonconvex Optimization , 2016, IEEE Transactions on Signal and Information Processing over Networks.

[19]  Jie Ouyang,et al.  Induction of multiclass multifeature split decision trees from distributed data , 2009, Pattern Recognit..

[20]  Dianhui Wang,et al.  Distributed music classification using Random Vector Functional-Link nets , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[21]  Hans-Peter Kriegel,et al.  A Fast Parallel Clustering Algorithm for Large Spatial Databases , 1999, Data Mining and Knowledge Discovery.

[22]  Steven C. H. Hoi,et al.  Classification in P2P networks with cascade support vector machines , 2013, TKDD.

[23]  Calyampudi R. Rao,et al.  A strongly consistent procedure for model selection in a regression problem , 1989 .

[24]  G.B. Giannakis,et al.  Distributed compression-estimation using wireless sensor networks , 2006, IEEE Signal Processing Magazine.

[25]  Ahmed Hamza Osman,et al.  New Approach for Automated Epileptic Disease Diagnosis Using an Integrated Self-Organization Map and Radial Basis Function Neural Network Algorithm , 2019, IEEE Access.

[26]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[27]  M. Narasimha Murty,et al.  Genetic K-means algorithm , 1999, IEEE Trans. Syst. Man Cybern. Part B.

[28]  Minglun Gong,et al.  Unsupervised hierarchical image segmentation through fuzzy entropy maximization , 2017, Pattern Recognit..

[29]  Simone Scardapane,et al.  Distributed spectral clustering based on Euclidean distance matrix completion , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[30]  Ashish Aggarwal,et al.  Secure Data Mining in Cloud Using Homomorphic Encryption , 2014, 2014 IEEE International Conference on Cloud Computing in Emerging Markets (CCEM).

[31]  Ke Chen,et al.  Spectral attribute learning for visual regression , 2017, Pattern Recognit..

[32]  Xiaodong Wang,et al.  Real-Time Nonparametric Anomaly Detection in High-Dimensional Settings , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[34]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[35]  Amparo Alonso-Betanzos,et al.  Nonlinear single layer neural network training algorithm for incremental, nonstationary and distributed learning scenarios , 2012, Pattern Recognit..

[36]  Vahid Tarokh,et al.  Supervised Learning Using Data-dependent Random Features with Application to Seizure Detection , 2018, 2018 IEEE Conference on Decision and Control (CDC).

[37]  Yuval Elovici,et al.  N-BaIoT—Network-Based Detection of IoT Botnet Attacks Using Deep Autoencoders , 2018, IEEE Pervasive Computing.

[38]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[39]  Derya Birant,et al.  ST-DBSCAN: An algorithm for clustering spatial-temporal data , 2007, Data Knowl. Eng..

[40]  Luca Martino,et al.  Multi-label methods for prediction with sequential data , 2015, Pattern Recognit..

[41]  Stephen P. Boyd,et al.  Distributed average consensus with least-mean-square deviation , 2007, J. Parallel Distributed Comput..

[42]  Simone Scardapane,et al.  Fully Decentralized Semi-supervised Learning via Privacy-preserving Matrix Completion , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[43]  Wenjia Wang,et al.  Dealing with Missing Data and Uncertainty in the Context of Data Mining , 2018, HAIS.

[44]  Hans-Peter Kriegel,et al.  DBDC: Density Based Distributed Clustering , 2004, EDBT.

[45]  Ali H. Sayed,et al.  Diffusion LMS Strategies for Distributed Estimation , 2010, IEEE Transactions on Signal Processing.

[46]  Joydeep Ghosh,et al.  Privacy-preserving distributed clustering using generative models , 2003, Third IEEE International Conference on Data Mining.

[47]  Sonia Martínez,et al.  Discrete-time dynamic average consensus , 2010, Autom..

[48]  H. Vincent Poor,et al.  Distributed learning in wireless sensor networks , 2005, IEEE Signal Processing Magazine.

[49]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[50]  Yuval Elovici,et al.  Kitsune: An Ensemble of Autoencoders for Online Network Intrusion Detection , 2018, NDSS.

[51]  Daniel Pérez Palomar,et al.  Distributed nonconvex multiagent optimization over time-varying networks , 2016, 2016 50th Asilomar Conference on Signals, Systems and Computers.

[52]  Antonello Rizzi,et al.  Refining accuracy of environmental data prediction by MoG neural networks , 2003, Neurocomputing.

[53]  Frede Blaabjerg,et al.  Overview of Control and Grid Synchronization for Distributed Power Generation Systems , 2006, IEEE Transactions on Industrial Electronics.

[54]  Michael Muma,et al.  Bayesian Cluster Enumeration Criterion for Unsupervised Learning , 2017, IEEE Transactions on Signal Processing.

[55]  Georgios B. Giannakis,et al.  Distributed Clustering Using Wireless Sensor Networks , 2011, IEEE Journal of Selected Topics in Signal Processing.

[56]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[57]  Hans-Peter Kriegel,et al.  Scalable Density-Based Distributed Clustering , 2004, PKDD.

[58]  H.C. Papadopoulos,et al.  Locally constructed algorithms for distributed computations in ad-hoc networks , 2004, Third International Symposium on Information Processing in Sensor Networks, 2004. IPSN 2004.

[59]  Simone Scardapane,et al.  Distributed semi-supervised support vector machines , 2016, Neural Networks.

[60]  S. Pattem,et al.  Distributed online localization in sensor networks using a moving target , 2004, Third International Symposium on Information Processing in Sensor Networks, 2004. IPSN 2004.