Summarizing Distributed Data Streams for Storage in Data Warehouses

Data warehouses are increasingly supplied with data produced by a large number of distributed sensors in many applications: medicine, military, road traffic, weather forecast, utilities like electric power suppliers etc. Such data is widely distributed and produced continuously as data streams. The rate at which data is collected at each sensor node affects the communication resources, the bandwidth and/or the computational load at the central server. In this paper, we propose a generic tool for summarizing distributed data streams where the amount of data being collected from each sensor adapts to data characteristics. Experiments done on electric power consumption real data are reported and show the efficiency of the proposed approach.

[1]  Rajeev Rastogi,et al.  Processing complex aggregate queries over data streams , 2002, SIGMOD '02.

[2]  Wei Hong,et al.  Approximate Data Collection in Sensor Networks using Probabilistic Models , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[3]  Rajeev Motwani,et al.  Load shedding for aggregation queries over data streams , 2004, Proceedings. 20th International Conference on Data Engineering.

[4]  Lionel Sacks,et al.  Adaptive Sampling Mechanisms in Sensor Networks , 2003 .

[5]  Rajeev Motwani,et al.  Sampling from a moving window over streaming data , 2002, SODA '02.

[6]  Johannes Gehrke,et al.  Cayuga: A General Purpose Event Monitoring System , 2007, CIDR.

[7]  Christos Faloutsos,et al.  AutoLag: automatic discovery of lag correlations in stream data , 2005, 21st International Conference on Data Engineering (ICDE'05).

[8]  Sudipto Guha,et al.  Data-streams and histograms , 2001, STOC '01.

[9]  Edward Y. Chang,et al.  Adaptive sampling for sensor networks , 2004, DMSN '04.

[10]  Michael Stonebraker,et al.  Load Shedding in a Data Stream Manager , 2003, VLDB.

[11]  Kui Wu,et al.  Energy efficient information collection with the ARIMA model in wireless sensor networks , 2005, GLOBECOM '05. IEEE Global Telecommunications Conference, 2005..

[12]  Michael Stonebraker,et al.  Monitoring Streams - A New Class of Data Management Applications , 2002, VLDB.

[13]  Graham Cormode,et al.  Approximate continuous querying over distributed streams , 2008, TODS.

[14]  Eamonn J. Keogh,et al.  An online algorithm for segmenting time series , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[15]  Michael Stonebraker,et al.  The Aurora and Medusa Projects , 2003, IEEE Data Eng. Bull..

[16]  Robert D. Nowak,et al.  Backcasting: adaptive sampling for sensor networks , 2004, Third International Symposium on Information Processing in Sensor Networks, 2004. IPSN 2004.

[17]  Divesh Srivastava,et al.  On computing correlated aggregates over continual data streams , 2001, SIGMOD '01.

[18]  Alexandros Labrinidis,et al.  Proceeedings of the 1st international workshop on Data management for sensor networks: in conjunction with VLDB 2004 , 2004 .

[19]  Dimitrios Gunopulos,et al.  Streaming Time Series Summarization Using User-Defined Amnesic Functions , 2008, IEEE Transactions on Knowledge and Data Engineering.