Clustering distributed sensor data streams using local processing and reduced communication

Nowadays applications produce infinite streams of data distributed across wide sensor networks. In this work we study the problem of continuously maintain a cluster structure over the data points generated by the entire network. Usual techniques operate by forwarding and concentrating the entire data in a central server, processing it as a multivariate stream. In this paper, we propose DGClust, a new distributed algorithm which reduces both the dimensionality and the communication burdens, by allowing each local sensor to keep an online discretization of its data stream, which operates with constant update time and (almost) fixed space. Each new data point triggers a cell in this univariate grid, reflecting the current state of the data stream at the local site. Whenever a local site changes its state, it notifies the central server about the new state it is in. This way, at each point in time, the central site has the global multivariate state of the entire network. To avoid monitoring all possible states, which is exponential in the number of sensors, the central site keeps a small list of counters of the most frequent global states. Finally, a simple adaptive partitional clustering algorithm is applied to the frequent states central points in order to provide an anytime definition of the clusters centers. The approach is evaluated in the context of distributed sensor networks, focusing on three outcomes: loss to real centroids, communication prevention, and processing reduction. The experimental work on synthetic data supports our proposal, presenting robustness to a high number of sensors, and the application to real data from physiological sensors exposes the aforementioned advantages of the system.

[1]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[2]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[3]  João Gama,et al.  Clustering Distributed Sensor Data Streams , 2008, ECML/PKDD.

[4]  Sudipto Guha,et al.  Streaming-data algorithms for high-quality clustering , 2002, Proceedings 18th International Conference on Data Engineering.

[5]  D. Eppstein,et al.  Approximation algorithms for geometric problems , 1996 .

[6]  João Gama,et al.  Discretization from data streams: applications to histograms and data mining , 2006, SAC.

[7]  Graham Cormode,et al.  Conquering the Divide: Continuous Clustering of Distributed Data Streams , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[8]  Geoff Hulten,et al.  A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering , 2001, ICML.

[9]  D. Eppstein,et al.  Approximation algorithms for geometric problems , 1996 .

[10]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[11]  João Gama,et al.  A system for analysis and prediction of electricity-load streams , 2009, Intell. Data Anal..

[12]  Tarek F. Abdelzaher,et al.  EnviroSuite: An environmentally immersive programming framework for sensor networks , 2006, TECS.

[13]  Ian F. Akyildiz,et al.  Sensor Networks , 2002, Encyclopedia of GIS.

[14]  Dimitrios Gunopulos,et al.  Online outlier detection in sensor data using non-parametric models , 2006, VLDB.

[15]  Alfredo Cuzzocrea,et al.  Intelligent Techniques for Warehousing and Mining Sensor Network Data , 2009 .

[16]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[17]  Ran Wolff,et al.  Distributed Data Mining in Peer-to-Peer Networks , 2006, IEEE Internet Computing.

[18]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[19]  João Gama,et al.  Evaluating algorithms that learn from data streams , 2009, SAC '09.

[20]  Won Suk Lee,et al.  Statistical grid-based clustering over data streams , 2004, SGMD.

[21]  Erik D. Demaine,et al.  Frequency Estimation of Internet Packet Streams with Limited Space , 2002, ESA.

[22]  Hans Mulder,et al.  Smart sensors to network the world. , 2004, Scientific American.

[23]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[24]  João Gama,et al.  Knowledge Discovery for Sensor Network Comprehension , 2010 .

[25]  João Gama,et al.  Requirements for Clustering Streaming Sensors , 2008 .

[26]  Mohamed Medhat Gaber,et al.  Learning from Data Streams: Processing Techniques in Sensor Networks , 2007 .

[27]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[28]  Dorit S. Hochbaum,et al.  Approximation Algorithms for NP-Hard Problems , 1996 .

[29]  Nesime Tatbul,et al.  Data Stream Processing , 2009, Encyclopedia of Database Systems.

[30]  Graham Cormode,et al.  What's hot and what's not: tracking most frequent items dynamically , 2003, PODS '03.

[31]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[32]  João Gama,et al.  Clustering Techniques in Sensor Networks , 2007 .

[33]  B. Betts,et al.  Smart Sensors , 2006, IEEE Spectrum.

[34]  Ryan Newton,et al.  Region streams: functional macroprogramming for sensor networks , 2004, DMSN '04.

[35]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[36]  Divyakant Agrawal,et al.  Efficient Computation of Frequent and Top-k Elements in Data Streams , 2005, ICDT.

[37]  Dorit S. Hochba,et al.  Approximation Algorithms for NP-Hard Problems , 1997, SIGA.