Distributed data classification in sensor networks

Low overhead analysis of large distributed data sets is necessary for current data centers and for future sensor networks. In such systems, each node holds some data value, e.g., a local sensor read, and a concise picture of the global system state needs to be obtained. In resource-constrained environments like sensor networks, this needs to be done without collecting all the data at any location, i.e., in a distributed, manner. To this end, we define the distributed classification problem, in which numerous interconnected nodes compute a classification of their data, i.e., partition these values into multiple collections, and describe each collection concisely. We present a generic algorithm that solves the distributed classification problem and may be implemented in various topologies, using different classification types. For example, the generic algorithm can be instantiated to classify values according to distance, like the famous k-means classification algorithm. However, the distance criterion is often not sufficient to provide good classification results. We present an instantiation of the generic algorithm that describes the values as a Gaussian Mixture (a set of weighted normal distributions), and uses machine learning tools for classification decisions. Simulations show the robustness and speed of this algorithm. We prove that any implementation of the generic algorithm converges over any connected topology, classification criterion and collection representation, in fully asynchronous settings.

[1]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[2]  David G. Stork,et al.  Pattern Classification , 1973 .

[3]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[4]  M. Davidson Catch 22 , 1977, Nature.

[5]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[6]  J. Bather,et al.  Mixture Reduction Algorithms for Uncertain Tracking , 1988 .

[7]  G. Asada,et al.  Wireless integrated network sensors: Low power systems on a chip , 1998, Proceedings of the 24th European Solid-State Circuits Conference.

[8]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[9]  Kristofer S. J. Pister,et al.  Smart Dust: Communicating with a Cubic-Millimeter Computer , 2001, Computer.

[10]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[11]  Gabor Karsai,et al.  Smart Dust: communicating with a cubic-millimeter computer , 2001 .

[12]  Johannes Gehrke,et al.  Gossip-based computation of aggregate information , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[13]  Anne-Marie Kermarrec,et al.  Lightweight probabilistic broadcast , 2003, TOCS.

[14]  Srinivasan Seshan,et al.  Synopsis diffusion for robust aggregation in sensor networks , 2004, SenSys '04.

[15]  Ran Wolff,et al.  A Local Algorithm for Ad Hoc Majority Voting via Charge Fusion , 2004, DISC.

[16]  Nikos A. Vlassis,et al.  Newscast EM , 2004, NIPS.

[17]  Stephen P. Boyd,et al.  Gossip algorithms: design, analysis and applications , 2005, Proceedings IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies..

[18]  Hillol Kargupta,et al.  K-Means Clustering Over a Large, Dynamic Network , 2006, SDM.

[19]  Robbert van Renesse,et al.  Gossip-based distribution estimation in peer-to-peer networks , 2008, IPTPS.

[20]  I. Keidar Distributed Clustering for Robust Aggregation in Large Networks , 2009 .

[21]  Guillaume Pierre,et al.  Adam2: Reliable Distribution Estimation in Decentralised Environments , 2010, 2010 IEEE 30th International Conference on Distributed Computing Systems.