A Decentralized and Robust Approach to Estimating a Probabilistic Mixture Model for Structuring Distributed Data

Data sharing services on the web host huge amounts of resources supplied and accessed by millions of users around the world. While the classical approach is a central control over the data set, even if this data set is distributed, there is growing interesting in decentralized solutions, because of good properties (in particularity, privacy and scaling up). In this paper, we explore a machine learning side of this work direction. We propose a novel technique for decentralized estimation of probabilistic mixture models, which are among the most versatile generative models for understanding data sets. More precisely, we demonstrate how to estimate a global mixture model from a set of local models. Our approach accommodates dynamic topology and data sources and is statistically robust, i.e. resilient to the presence of unreliable local models. Such outlier models may arise from local data which are outliers, compared to the global trend, or poor mixture estimation. We report experiments on synthetic data and real geo-location data from Flickr.

[1]  Dongbing Gu,et al.  Distributed EM Algorithm for Gaussian Mixtures in Sensor Networks , 2008, IEEE Transactions on Neural Networks.

[2]  Pierrick Bruneau,et al.  Parsimonious reduction of Gaussian mixture models with a variational-Bayes approach , 2010, Pattern Recognit..

[3]  Kui Wu,et al.  Knowledge Propagation in Collaborative Tagging for Image Retrieval , 2010, J. Signal Process. Syst..

[4]  Nikos A. Vlassis,et al.  Newscast EM , 2004, NIPS.

[5]  Cordelia Schmid,et al.  Toward Category-Level Object Recognition , 2006, Toward Category-Level Object Recognition.

[6]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[7]  Hillol Kargupta,et al.  K-Means Clustering Over a Large, Dynamic Network , 2006, SDM.

[8]  Márk Jelasity,et al.  Gossip-based aggregation in large dynamic networks , 2005, TOCS.

[9]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[10]  Behrooz Safarinejadian,et al.  A distributed EM algorithm to estimate the parameters of a finite mixture of components , 2009, Knowledge and Information Systems.

[11]  Anne-Marie Kermarrec,et al.  WhatsUp: News, From, For, Through, Everyone , 2010, 2010 IEEE Tenth International Conference on Peer-to-Peer Computing (P2P).

[12]  Anne-Marie Kermarrec,et al.  Gossiping in distributed systems , 2007, OPSR.

[13]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[14]  Anne-Marie Kermarrec,et al.  Gossip-based peer sampling , 2007, TOCS.

[15]  John R. Hershey,et al.  Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[16]  Refik Molva,et al.  Privacy and confidentiality in context-based and epidemic forwarding , 2010, Comput. Commun..

[17]  Dan Klein,et al.  Fully distributed EM for very large datasets , 2008, ICML '08.

[18]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[19]  Gediminas Adomavicius,et al.  Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions , 2005, IEEE Transactions on Knowledge and Data Engineering.

[20]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[21]  Cordelia Schmid,et al.  Toward Category-Level Object Recognition (Lecture Notes in Computer Science) , 2007 .

[22]  Jacob Goldberger,et al.  Hierarchical Clustering of a Mixture Model , 2004, NIPS.

[23]  Katrien van Driessen,et al.  A Fast Algorithm for the Minimum Covariance Determinant Estimator , 1999, Technometrics.