Robust estimation of a global Gaussian mixture by decentralized aggregations of local models

Distributed data collections are now more and more common due to the emergence of cloud computing, to spatially decentralized businesses, or to the availability of various data sharing web services. Obtain knowledge in such a collection raises then the need of new data mining methods to apply in a decentralized architecture. In this paper, we explore a machine learning side of this work direction. We propose a novel technique for decentralized estimation of probabilistic mixture models, which are among the most versatile generative models for understanding data sets. More precisely, we demonstrate how to estimate a global mixture model from a set of local models. Our approach accommodates dynamic topology and data sources and is statistically robust, i.e. resilient to the presence of unreliable local models. Such outlier models may arise from local data which are outliers, compared to the global trend, or poor mixture estimation. We report experiments on synthetic data and real geo-location data from Flickr.

[1]  Ujjwal Maulik,et al.  Clustering distributed data streams in peer-to-peer environments , 2006, Inf. Sci..

[2]  Mohamed S. Kamel,et al.  HP2PC: Scalable Hierarchically-Distributed Peer-to-Peer Clustering , 2007, SDM.

[3]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[4]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[5]  Rasha F. Kashef,et al.  Cooperative Clustering Model and Its Applications , 2008 .

[6]  Hans-Peter Kriegel,et al.  Scalable Density-Based Distributed Clustering , 2004, PKDD.

[7]  Hillol Kargupta,et al.  K-Means Clustering Over a Large, Dynamic Network , 2006, SDM.

[8]  Márk Jelasity,et al.  Gossip-based aggregation in large dynamic networks , 2005, TOCS.

[9]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[10]  Behrooz Safarinejadian,et al.  A distributed EM algorithm to estimate the parameters of a finite mixture of components , 2009, Knowledge and Information Systems.

[11]  Hans-Peter Kriegel,et al.  Effective and efficient distributed model-based clustering , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[12]  Anne-Marie Kermarrec,et al.  WhatsUp: News, From, For, Through, Everyone , 2010, 2010 IEEE Tenth International Conference on Peer-to-Peer Computing (P2P).

[13]  Anne-Marie Kermarrec,et al.  Gossiping in distributed systems , 2007, OPSR.

[14]  John R. Hershey,et al.  Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[15]  Dan Klein,et al.  Fully distributed EM for very large datasets , 2008, ICML '08.

[16]  Wolfgang Müller,et al.  Classifying Documents by Distributed P2P Clustering , 2003, GI Jahrestagung.

[17]  P. Rousseeuw,et al.  A fast algorithm for the minimum covariance determinant estimator , 1999 .

[18]  Nikos A. Vlassis,et al.  Newscast EM , 2004, NIPS.

[19]  Refik Molva,et al.  Privacy and confidentiality in context-based and epidemic forwarding , 2010, Comput. Commun..

[20]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[21]  Dongbing Gu,et al.  Distributed EM Algorithm for Gaussian Mixtures in Sensor Networks , 2008, IEEE Transactions on Neural Networks.

[22]  Mohamed S. Kamel,et al.  Distributed collaborative Web document clustering using cluster keyphrase summaries , 2008, Inf. Fusion.

[23]  Jacob Goldberger,et al.  Hierarchical Clustering of a Mixture Model , 2004, NIPS.

[24]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[25]  Anne-Marie Kermarrec,et al.  Gossip-based peer sampling , 2007, TOCS.

[26]  H. Kriegel,et al.  Towards Effective and Efficient Distributed Clustering , 2003 .

[27]  Bradley M. Hemminger,et al.  Scientific data repositories on the Web: An initial survey , 2010 .

[28]  Khaled M. Hammouda Distributed Document Clustering and Cluster Summarization in Peer-to-Peer Environments , 2007 .

[29]  Pierrick Bruneau,et al.  Parsimonious reduction of Gaussian mixture models with a variational-Bayes approach , 2010, Pattern Recognit..