Distributed data clustering in sensor networks

Low overhead analysis of large distributed data sets is necessary for current data centers and for future sensor networks. In such systems, each node holds some data value, e.g., a local sensor read, and a concise picture of the global system state needs to be obtained. In resource-constrained environments like sensor networks, this needs to be done without collecting all the data at any location, i.e., in a distributed manner. To this end, we address the distributed clustering problem, in which numerous interconnected nodes compute a clustering of their data, i.e., partition these values into multiple clusters, and describe each cluster concisely. We present a generic algorithm that solves the distributed clustering problem and may be implemented in various topologies, using different clustering types. For example, the generic algorithm can be instantiated to cluster values according to distance, targeting the same problem as the famous k-means clustering algorithm. However, the distance criterion is often not sufficient to provide good clustering results. We present an instantiation of the generic algorithm that describes the values as a Gaussian Mixture (a set of weighted normal distributions), and uses machine learning tools for clustering decisions. Simulations show the robustness, speed and scalability of this algorithm. We prove that any implementation of the generic algorithm converges over any connected topology, clustering criterion and cluster representation, in fully asynchronous settings.

[1]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[2]  H. L. Le Roy,et al.  Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; Vol. IV , 1969 .

[3]  David G. Stork,et al.  Pattern Classification , 1973 .

[4]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[5]  M. Davidson Catch 22 , 1977, Nature.

[6]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[7]  J. Bather,et al.  Mixture Reduction Algorithms for Uncertain Tracking , 1988 .

[8]  G. Asada,et al.  Wireless integrated network sensors: Low power systems on a chip , 1998, Proceedings of the 24th European Solid-State Circuits Conference.

[9]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[10]  Gabor Karsai,et al.  Smart Dust: communicating with a cubic-millimeter computer , 2001 .

[11]  Johannes Gehrke,et al.  Gossip-based computation of aggregate information , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[12]  Anne-Marie Kermarrec,et al.  Lightweight probabilistic broadcast , 2003, TOCS.

[13]  Srinivasan Seshan,et al.  Synopsis diffusion for robust aggregation in sensor networks , 2004, SenSys '04.

[14]  Ran Wolff,et al.  A Local Algorithm for Ad Hoc Majority Voting via Charge Fusion , 2004, DISC.

[15]  Nikos A. Vlassis,et al.  Newscast EM , 2004, NIPS.

[16]  Stephen P. Boyd,et al.  Gossip algorithms: design, analysis and applications , 2005, Proceedings IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies..

[17]  Hillol Kargupta,et al.  K-Means Clustering Over a Large, Dynamic Network , 2006, SDM.

[18]  Anne-Marie Kermarrec,et al.  Gossip-based peer sampling , 2007, TOCS.

[19]  Robbert van Renesse,et al.  Gossip-based distribution estimation in peer-to-peer networks , 2008, IPTPS.

[20]  I. Keidar Distributed Clustering for Robust Aggregation in Large Networks , 2009 .

[21]  Joan Jacobs Correctness of Gossip-Based Membership under Message Loss , 2009 .

[22]  Idit Keidar,et al.  Distributed data classification in sensor networks , 2010, PODC.

[23]  Guillaume Pierre,et al.  Adam2: Reliable Distribution Estimation in Decentralised Environments , 2010, 2010 IEEE 30th International Conference on Distributed Computing Systems.