Summarizing numeric spatial data streams by trend cluster discovery

Advances in pervasive computing and sensor technologies have paved the way for the explosive living ubiquity of geo-physical data streams. The management of the massive and unbounded streams of sensor data produced poses several challenges, including the real-time application of summarization techniques, which should allow the storage and query of this amount of georeferenced and timestamped data in a server with limited memory. In order to face this issue, we have designed a summarization technique, called SUMATRA, which segments the stream into windows, computes summaries window-by-window and stores these summaries in a database. Trend clusters are discovered as summaries of each window. They are clusters of georeferenced data which vary according to a similar trend along the window time horizon. Several compression techniques are also investigated to derive a compact, but accurate representation of these trends for storage in the database. A learning strategy to automatically choose the best trend compression technique is designed. Finally, an in-network modality for tree-based trend cluster discovery is investigated in order to achieve an efficacious aggregation schema which drastically reduces the number of bytes transmitted across the network and maintains a longer network lifespan. This schema is mapped onto the routing structure of a tree-based WSN topology. Experiments performed with several data streams of real sensor networks assess the summarization capability, the accuracy and the efficiency of the proposed summarization schema.

[1]  Florin Rusu,et al.  Sketching Sampled Data Streams , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[2]  Dennis Shasha,et al.  StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time , 2002, VLDB.

[3]  Sudipto Guha,et al.  Dynamic multidimensional histograms , 2002, SIGMOD '02.

[4]  Qiong Luo,et al.  Distributed, Hierarchical Clustering and Summarization in Sensor Networks , 2007, APWeb/WAIM.

[5]  Shashi Shekhar,et al.  Spatial Databases: A Tour , 2003 .

[6]  João Gama,et al.  Clustering Distributed Sensor Data Streams , 2008, ECML/PKDD.

[7]  Michelle H Browdy Simulated Annealing: An Improved Computer Model for Political Redistricting , 1990 .

[8]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[9]  Raúl Ramos Lobo,et al.  Supervised regionalization methods: A survey , 2006 .

[10]  Donato Malerba,et al.  Trend cluster based compression of geographically distributed data streams , 2011, 2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM).

[11]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[12]  N. Draper,et al.  Applied Regression Analysis. , 1967 .

[13]  Deborah Estrin,et al.  Multiresolution storage and search in sensor networks , 2005, TOS.

[14]  Fionn Murtagh,et al.  A Survey of Algorithms for Contiguity-Constrained Clustering and Related Problems , 1985, Comput. J..

[15]  Csaba D. Tóth,et al.  Adaptive Spatial Partitioning for Multidimensional Data Streams , 2004, Algorithmica.

[16]  Andrea Conti,et al.  An Overview on Wireless Sensor Networks Technology and Evolution , 2009, Sensors.

[17]  Donato Malerba,et al.  Spatial Clustering of Structured Objects , 2005, ILP.

[18]  Anthony Recchia,et al.  Contiguity-Constrained Hierarchical Agglomerative Clustering Using SAS , 2010 .

[19]  Diansheng Guo,et al.  Regionalization with dynamically constrained agglomerative clustering and partitioning (REDCAP) , 2008, Int. J. Geogr. Inf. Sci..

[20]  Robert Haining,et al.  Regionalisation Tools for the Exploratory Spatial Analysis of Health Data , 1997 .

[21]  Donato Malerba,et al.  Online and Offline Trend Cluster Discovery in Spatially Distributed Data Streams , 2010, MSM/MUSE.

[22]  Cyrus Shahabi,et al.  The Clustered AGgregation (CAG) technique leveraging spatial and temporal correlations in wireless sensor networks , 2007, TOSN.

[23]  Viswanath Poosala,et al.  Congressional samples for approximate answering of group-by queries , 2000, SIGMOD '00.

[24]  Wei Hong,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation Tag: a Tiny Aggregation Service for Ad-hoc Sensor Networks , 2022 .

[25]  S. Al Wadi,et al.  A Comparison Between Haar Wavelet Transform and Fast Fourier Transform in Analyzing Financial Time Series Data , 2010 .

[26]  Li Wei,et al.  Experiencing SAX: a novel symbolic representation of time series , 2007, Data Mining and Knowledge Discovery.

[27]  P. Legendre Spatial Autocorrelation: Trouble or New Paradigm? , 1993 .

[28]  Zhikui Chen,et al.  A clustering approximation mechanism based on data spatial correlation in wireless sensor networks , 2010, 2010 Wireless Telecommunications Symposium (WTS).

[29]  Mohamed K. Watfa,et al.  A Sensor Network Data Aggregation Technique , 2009 .

[30]  Filippo Furfaro,et al.  Compressed hierarchical binary histograms for summarizing multi-dimensional data , 2007, Knowledge and Information Systems.

[31]  Christophe Perruchet,et al.  Constrained agglomerative hierarchical classification , 1983, Pattern Recognit..

[32]  Yannis Manolopoulos,et al.  Continuous Trend-Based Clustering in Data Streams , 2008, DaWaK.

[33]  Raja Chiky,et al.  Summarizing Distributed Data Streams for Storage in Data Warehouses , 2008, DaWaK.

[34]  Jiawei Han,et al.  Geographic Data Mining and Knowledge Discovery , 2001 .

[35]  Yannis E. Ioannidis,et al.  Balancing histogram optimality and practicality for query result size estimation , 1995, SIGMOD '95.

[36]  Jörg Sander,et al.  Effective Summarization of Multi-Dimensional Data Streams for Historical Stream Mining , 2007, 19th International Conference on Scientific and Statistical Database Management (SSDBM 2007).

[37]  Sudipto Guha,et al.  Fast, small-space algorithms for approximate histogram maintenance , 2002, STOC '02.

[38]  Rajeev Motwani,et al.  Randomized algorithms , 1996, CSUR.

[39]  Donato Malerba,et al.  Summarization for Geographically Distributed Data Streams , 2010, KES.

[40]  Shonali Krishnaswamy,et al.  Mining data streams: a review , 2005, SGMD.

[41]  John G. Proakis,et al.  Digital signal processing (3rd ed.): principles, algorithms, and applications , 1996 .

[42]  Robert L. Grossman,et al.  Data Mining for Scientific and Engineering Applications , 2001, Massive Computing.

[43]  Bo Thiesson,et al.  Fast Variational Mode-Seeking , 2012, AISTATS.

[44]  Sanjeev Khanna,et al.  Space-efficient online computation of quantile summaries , 2001, SIGMOD '01.

[45]  Weili Wu,et al.  Modeling Spatial Dependencies for Mining Geospatial Data , 2001, SDM.

[46]  Stphane Mallat,et al.  A Wavelet Tour of Signal Processing, Third Edition: The Sparse Way , 2008 .

[47]  Weili Wu,et al.  Modelling spatial dependencies for mining geospatial data: An introduction , 2001 .

[48]  A. D. Gordon A survey of constrained classification , 1996 .

[49]  Eduardo Tovar,et al.  Real-Time Communications Over Cluster-Tree Sensor Networks with Mobile Sink Behaviour , 2008, 2008 14th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications.

[50]  Michele Rossi,et al.  To Compress or Not To Compress: Processing vs Transmission Tradeoffs for Energy Constrained Sensor Networking , 2012, ArXiv.

[51]  Josef Kittler,et al.  A locally sensitive method for cluster analysis , 1976, Pattern Recognit..

[52]  Weilian Su,et al.  Communication protocols for sensor networks , 2004 .

[53]  S. Gale,et al.  The Philosophy of Geography , 2021, Springer Geography.

[54]  Amit Kumar,et al.  Deterministic wavelet thresholding for maximum-error metrics , 2004, PODS.

[55]  Philip S. Yu,et al.  On Clustering Massive Data Streams: A Summarization Paradigm , 2007, Data Streams - Models and Algorithms.

[56]  Cyrus Shahabi,et al.  Exploiting spatial correlation towards an energy efficient clustered aggregation technique (CAG) [wireless sensor network applications] , 2005, IEEE International Conference on Communications, 2005. ICC 2005. 2005.

[57]  George Valkanas,et al.  Deploying In-Network Data Analysis Techniques in Sensor Networks , 2011, 2011 IEEE 12th International Conference on Mobile Data Management.

[58]  Jeffrey Scott Vitter,et al.  Dynamic Maintenance of Wavelet-Based Histograms , 2000, VLDB.

[59]  Yingshu Li,et al.  In-Network Historical Data Storage and Query Processing Based on Distributed Indexing Techniques in Wireless Sensor Networks , 2009, WASA.

[60]  S. Mallat A wavelet tour of signal processing , 1998 .

[61]  J. LeSage,et al.  Spatial Dependence in Data Mining , 2001 .

[62]  John G. Proakis,et al.  Digital Signal Processing: Principles, Algorithms, and Applications , 1992 .