How to Screen a Data Stream - Quality-Driven Load Shedding in Sensor Data Streams

As most data stream sources exhibit bursty data rates, data stream management systems must recurrently cope with load spikes that exceed the average workload to a considerable degree. To guarantee low-latency processing results, load has to be shed from the stream, when data rates overstress system resources. There exist numerous load shedding strategies to delete excess data. However, the consequent data loss leads to incomplete and/or inaccurate results during the ongoing stream processing. In this paper, we present a novel quality-driven load shedding approach that screens the data stream to find and discard data items of minor quality. The data quality of stream processing results is maximized under the adverse condition of data overload. After an introduction to data quality management in data streams, we define three data quality-driven load shedding algorithms, which minimize the approximation error of aggregations and maximize the completeness of join processing results, respectively. Finally, we demonstrate their superiority over existing load shedding techniques at real-life weather data.

[1]  Jennifer Widom,et al.  Memory-Limited Execution of Windowed Stream Joins , 2004, VLDB.

[2]  Jeffrey F. Naughton,et al.  Evaluating window joins over unbounded streams , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[3]  Stephen G. Warren,et al.  Edited synoptic cloud reports from ships and land stations over the globe , 1996 .

[4]  Theodore Johnson,et al.  Query-Aware Sampling for Data Streams , 2007, 2007 IEEE 23rd International Conference on Data Engineering Workshop.

[5]  K. Selçuk Candan,et al.  Data-quality Guided Load Shedding for Expensive In-Network Data Processing , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[6]  Michael Stonebraker,et al.  Aurora: a new model and architecture for data stream management , 2003, The VLDB Journal.

[7]  Felix Naumann,et al.  Assessment Methods for Information Quality Criteria , 2000, IQ.

[8]  Yu Min,et al.  Semantic Load Shedding for Sliding Window Join-Aggregation Queries over Data Streams , 2007, 2007 International Conference on Convergence Information Technology (ICCIT 2007).

[9]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[10]  Rajeev Motwani,et al.  Load shedding for aggregation queries over data streams , 2004, Proceedings. 20th International Conference on Data Engineering.

[11]  Carlo Batini,et al.  Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications) , 2006 .

[12]  Diane M. Strong,et al.  Data quality in context , 1997, CACM.

[13]  Anja Klein Incorporating quality aspects in sensor data streams , 2007, PIKM '07.

[14]  Qiang Chen,et al.  Aurora : a new model and architecture for data stream management ) , 2006 .

[15]  Stanley B. Zdonik,et al.  Window-aware load shedding for aggregation queries over data streams , 2006, VLDB.

[16]  Wen-Chi Hou,et al.  Window join approximation over data streams with importance semantics , 2006, CIKM '06.

[17]  E. Ziegel Juran's Quality Control Handbook , 1988 .

[18]  Carlo Batini,et al.  Data Quality: Concepts, Methodologies and Techniques , 2006, Data-Centric Systems and Applications.

[19]  Peter J. Haas,et al.  Large-sample and deterministic confidence intervals for online aggregation , 1997, Proceedings. Ninth International Conference on Scientific and Statistical Database Management (Cat. No.97TB100150).

[20]  Bernhard Seeger,et al.  PIPES: a public infrastructure for processing and exploring streams , 2004, SIGMOD '04.