Capturing Data Uncertainty in High-Volume Stream Processing

We present the design and development of a data stream system that captures data uncertainty from data collection to query processing to final result generation. Our system focuses on data that is naturally modeled as continuous random variables such as many types of sensor data. To provide an end-to-end solution, our system employs probabilistic modeling and inference to generate uncertainty description for raw data, and then a suite of statistical techniques to capture changes of uncertainty as data propagates through query operators. To cope with high-volume streams, we explore advanced approximation techniques for both space and time efficiency. We are currently working with a group of scientists to evaluate our system using traces collected from real-world applications for hazardous weather monitoring and for object tracking and monitoring.

[1]  Lei Chen,et al.  A Weighted Moving Average-based Approach for Cleaning Sensor Data , 2007, 27th International Conference on Distributed Computing Systems (ICDCS '07).

[2]  Prashant J. Shenoy,et al.  Probabilistic Inference over RFID Streams in Mobile Environments , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[3]  Peter J. Haas,et al.  MCDB: a monte carlo approach to managing uncertain data , 2008, SIGMOD Conference.

[4]  Yossi Matias,et al.  New sampling-based summary statistics for improving approximate query answers , 1998, SIGMOD '98.

[5]  Wei Hong,et al.  The design of an acquisitional query processor for sensor networks , 2003, SIGMOD '03.

[6]  Graham Cormode,et al.  Sketching probabilistic data streams , 2007, SIGMOD '07.

[7]  Jennifer Widom,et al.  Working Models for Uncertain Data , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[8]  Yang Li,et al.  Cascadia: A System for Specifying, Detecting, and Managing RFID Events , 2008, MobiSys '08.

[9]  Christopher Ré,et al.  Approximate lineage for probabilistic databases , 2008, Proc. VLDB Endow..

[10]  Tim Hesterberg,et al.  Monte Carlo Strategies in Scientific Computing , 2002, Technometrics.

[11]  Wei Hong,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation Tag: a Tiny Aggregation Service for Ad-hoc Sensor Networks , 2022 .

[12]  Jennifer Widom,et al.  Trio: A System for Integrated Management of Data, Accuracy, and Lineage , 2004, CIDR.

[13]  Michael Stonebraker,et al.  Monitoring Streams - A New Class of Data Management Applications , 2002, VLDB.

[14]  Christian Floerkemeier,et al.  Inventory Management with an RFID-equipped Mobile Robot , 2007, 2007 IEEE International Conference on Automation Science and Engineering.

[15]  Ryan Newton,et al.  XStream: a Signal-Oriented Data Stream Management System , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[16]  Srinivasan Seshan,et al.  Synopsis diffusion for robust aggregation in sensor networks , 2004, SenSys '04.

[17]  Raghu Ramakrishnan,et al.  Optimizing mpf queries: decision support and probabilistic inference , 2007, SIGMOD '07.

[18]  Sunil Prabhakar,et al.  Evaluating probabilistic queries over imprecise data , 2003, SIGMOD '03.

[19]  Dan Suciu,et al.  Management of probabilistic data: foundations and challenges , 2007, PODS '07.

[20]  Phillip B. Gibbons Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports , 2001, VLDB.

[21]  Simson L. Garfinkel,et al.  RFID: Applications, Security, and Privacy , 2005 .

[22]  Johannes Gehrke,et al.  Query Processing in Sensor Networks , 2003, CIDR.

[23]  Daisy Zhe Wang,et al.  BayesStore: managing large, uncertain data repositories with probabilistic graphical models , 2008, Proc. VLDB Endow..

[24]  Susanne E. Hambrusch,et al.  Database Support for Probabilistic Attributes and Tuples , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[25]  Michael Pittarelli,et al.  The Theory of Probabilistic Databases , 1987, VLDB.

[26]  Jennifer Widom,et al.  Exploiting Lineage for Confidence Computation in Uncertain and Probabilistic Databases , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[27]  Laks V. S. Lakshmanan,et al.  ProbView: a flexible probabilistic database system , 1997, TODS.

[28]  Michael Zink,et al.  An End-User-Responsive Sensor Network Architecture for Hazardous Weather Detection, Prediction and Response , 2006, AINTEC.

[29]  Hector Garcia-Molina,et al.  The Management of Probabilistic Data , 1992, IEEE Trans. Knowl. Data Eng..

[30]  Richard A. Davis,et al.  Time Series: Theory and Methods , 2013 .

[31]  Dan Olteanu,et al.  From complete to incomplete information and back , 2007, SIGMOD '07.

[32]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[33]  Dan Olteanu,et al.  Conditioning probabilistic databases , 2008, Proc. VLDB Endow..

[34]  Minos N. Garofalakis,et al.  An adaptive RFID middleware for supporting metaphysical data independence , 2008, The VLDB Journal.

[35]  Amol Deshpande,et al.  Ef?cient Query Evaluation over Temporally Correlated Probabilistic Streams , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[36]  Abhinandan Das,et al.  Approximate join processing over data streams , 2003, SIGMOD '03.

[37]  Christopher Ré,et al.  Efficient Top-k Query Evaluation on Probabilistic Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[38]  Andrew McGregor,et al.  Estimating statistical aggregates on probabilistic data streams , 2008, TODS.

[39]  Frederick Reiss,et al.  Design Considerations for High Fan-In Systems: The HiFi Approach , 2005, CIDR.

[40]  Gustavo Alonso,et al.  Declarative Support for Sensor Data Cleaning , 2006, Pervasive.

[41]  Rajeev Rastogi,et al.  Independence is good: dependency-based histogram synopses for high-dimensional data , 2001, SIGMOD '01.

[42]  Wei Hong,et al.  Exploiting correlated attributes in acquisitional query processing , 2005, 21st International Conference on Data Engineering (ICDE'05).

[43]  Amol Deshpande,et al.  Online Filtering, Smoothing and Probabilistic Modeling of Streaming data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[44]  Ben Taskar,et al.  Selectivity estimation using probabilistic models , 2001, SIGMOD '01.

[45]  Parag Agrawal,et al.  Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS (Demo) , 2007, CIDR.

[46]  Jennifer Widom,et al.  ULDBs: databases with uncertainty and lineage , 2006, VLDB.

[47]  L. Williams,et al.  Contents , 2020, Ophthalmology (Rochester, Minn.).

[48]  Ing Rj Ser Approximation Theorems of Mathematical Statistics , 1980 .

[49]  Jun Yang,et al.  Constraint chaining: on energy-efficient continuous monitoring in sensor networks , 2006, SIGMOD Conference.

[50]  Jeffrey Considine,et al.  Approximate aggregation techniques for sensor databases , 2004, Proceedings. 20th International Conference on Data Engineering.

[51]  Christopher Ré,et al.  Event queries on correlated probabilistic streams , 2008, SIGMOD Conference.

[52]  Yunhao Liu,et al.  Contour map matching for event detection in sensor networks , 2006, SIGMOD Conference.

[53]  Lise Getoor,et al.  Exploiting shared correlations in probabilistic databases , 2008, Proc. VLDB Endow..

[54]  Samuel Madden,et al.  Querying continuous functions in a database system , 2008, SIGMOD Conference.

[55]  Minos N. Garofalakis,et al.  Approximate Query Processing: Taming the TeraBytes , 2001, VLDB.

[56]  Wei Hong,et al.  Model-Driven Data Acquisition in Sensor Networks , 2004, VLDB.

[57]  Stanley B. Zdonik,et al.  Handling Uncertain Data in Array Database Systems , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[58]  Prashant J. Shenoy,et al.  Approximate Initialization of Camera Sensor Networks , 2007, EWSN.

[59]  Kamesh Munagala,et al.  Energy-efficient monitoring of extreme values in sensor networks , 2006, SIGMOD Conference.

[60]  Samuel Madden,et al.  MauveDB: supporting model-based user views in database systems , 2006, SIGMOD Conference.

[61]  Samuel Madden,et al.  Using Probabilistic Models for Data Management in Acquisitional Environments , 2005, CIDR.

[62]  Daisy Zhe Wang,et al.  Probabilistic Data Management for Pervasive Computing: The Data Furnace Project , 2006, IEEE Data Eng. Bull..