Modeling and Querying Data Series and Data Streams with Uncertainty

Many real applications consume data that is intrinsically uncertain and error-prone. An uncertain data series is a series whose point values are uncertain. An uncertain data stream is a data stream whose tuples are existentially uncertain and/or have an uncertain value. Typical sources of uncertainty in data series and data streams include sensor data, data synopses, privacy-preserving transformations and forecasting models. In this thesis, we focus on the following three problems: (1) the formulation and the evaluation of similarity search queries in uncertain data series; (2) the evaluation of nearest neighbor search queries in uncertain data series; (3) the adaptation of sliding windows in uncertain data stream processing to accommodate existential and value uncertainty. We demonstrate experimentally that the correlation among neighboring time-stamps in data series can be leveraged to increase the accuracy of the results. We further show that the "possible world" semantics can be used as underlying uncertainty model to formulate nearest neighbor queries that can be evaluated efficiently. Finally, we discuss the relation between existential and value uncertainty in data stream applications, and verify experimentally our proposal of uncertain sliding windows.

[1]  Reynold Cheng,et al.  Efficient Mining of Frequent Item Sets on Large Uncertain Databases , 2012, IEEE Transactions on Knowledge and Data Engineering.

[2]  Lei Chen,et al.  Robust and fast similarity search for moving object trajectories , 2005, SIGMOD '05.

[3]  Carson Kai-Sang Leung,et al.  Mining of Frequent Itemsets from Streams of Uncertain Data , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[4]  Dimitrios Gunopulos,et al.  Finding Similar Time Series , 1997, PKDD.

[5]  Philip S. Yu,et al.  PROUD: a probabilistic approach to processing similarity queries over uncertain data streams , 2009, EDBT '09.

[6]  Ying Xing,et al.  The Design of the Borealis Stream Processing Engine , 2005, CIDR.

[7]  Yang-Sae Moon,et al.  Duality-based subsequence matching in time-series databases , 2001, Proceedings 17th International Conference on Data Engineering.

[8]  Donald J. Berndt,et al.  Using Dynamic Time Warping to Find Patterns in Time Series , 1994, KDD Workshop.

[9]  Andrew McGregor,et al.  Estimating statistical aggregates on probabilistic data streams , 2007, PODS.

[10]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[11]  Eamonn J. Keogh,et al.  iSAX: indexing and mining terabyte sized time series , 2008, KDD.

[12]  Hui Ding,et al.  Querying and mining of time series data: experimental comparison of representations and distance measures , 2008, Proc. VLDB Endow..

[13]  Alain Biem,et al.  IBM infosphere streams for scalable, real-time, intelligent transportation services , 2010, SIGMOD Conference.

[14]  Lei Chen,et al.  Similarity Join Processing on Uncertain Data Streams , 2011, IEEE Transactions on Knowledge and Data Engineering.

[15]  Hans-Peter Kriegel,et al.  Probabilistic Nearest-Neighbor Query on Uncertain Objects , 2007, DASFAA.

[16]  Moustafa Youssef,et al.  CoSDEO 2016 Keynote: A decade later — Challenges: Device-free passive localization for wireless environments , 2007, 2016 IEEE International Conference on Pervasive Computing and Communication Workshops (PerCom Workshops).

[17]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[18]  Deborah Estrin,et al.  New Approaches in Embedded Networked Sensing for Terrestrial Ecological Observatories , 2007 .

[19]  Hans-Peter Kriegel,et al.  Probabilistic Similarity Search for Uncertain Time Series , 2009, SSDBM.

[20]  Ihab F. Ilyas,et al.  A survey of top-k query processing techniques in relational database systems , 2008, CSUR.

[21]  Jeffrey Xu Yu,et al.  Probabilistic Skyline Operator over Sliding Windows , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[22]  M. Fernandez,et al.  Closed-Form Expression for the Poisson-Binomial Probability Density Function , 2010, IEEE Transactions on Aerospace and Electronic Systems.

[23]  Roar Nybø Time series opportunities in the petroleum industry , 2008 .

[24]  Bugra Gedik,et al.  A model‐based framework for building extensible, high performance stream processing middleware and programming language for IBM InfoSphere Streams , 2012, Softw. Pract. Exp..

[25]  Amy L. Murphy,et al.  What does model-driven data acquisition really achieve in wireless sensor networks? , 2012, 2012 IEEE International Conference on Pervasive Computing and Communications.

[26]  Dmitri V. Kalashnikov,et al.  Index for fast retrieval of uncertain spatial point data , 2006, GIS '06.

[27]  Jeffrey Xu Yu,et al.  Sliding-window top-k queries on uncertain streams , 2008, The VLDB Journal.

[28]  Clu-istos Foutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[29]  Mohamed A. Soliman,et al.  Top-k Query Processing in Uncertain Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[30]  Charu C. Aggarwal,et al.  Managing and Mining Uncertain Data , 2009, Advances in Database Systems.

[31]  Dan Suciu,et al.  Management of probabilistic data: foundations and challenges , 2007, PODS '07.

[32]  Katsiaryna Mirylenka,et al.  Uncertain Time-Series Similarity: Return to the Basics , 2012, Proc. VLDB Endow..

[33]  Alok N. Choudhary,et al.  Uncertain Range Queries for Necklaces , 2010, 2010 Eleventh International Conference on Mobile Data Management.

[34]  Jeffrey Scott Vitter,et al.  Efficient Indexing Methods for Probabilistic Threshold Queries over Uncertain Data , 2004, VLDB.

[35]  Toon Calders,et al.  Approximation of Frequentness Probability of Itemsets in Uncertain Data , 2010, 2010 IEEE International Conference on Data Mining.

[36]  Kun-Lung Wu,et al.  IBM Research Report SPL Stream Processing Language Specification , 2009 .

[37]  Sunil Prabhakar,et al.  Querying imprecise data in moving object environments , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[38]  Mark D. Yarvis,et al.  Design and deployment of industrial sensor networks: experiences from a semiconductor plant and the north sea , 2005, SenSys '05.

[39]  Xiang Lian,et al.  Efficient join processing on uncertain data streams , 2009, CIKM.

[40]  Renée J. Miller,et al.  ConQuer: efficient management of inconsistent databases , 2005, SIGMOD '05.

[41]  Marco Patella,et al.  Bulk Loading the M-tree , 2001 .

[42]  Leonardo Neumeyer,et al.  S4: Distributed Stream Computing Platform , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[43]  Xiang Lian,et al.  Probabilistic ranked queries in uncertain databases , 2008, EDBT '08.

[44]  Nick Roussopoulos,et al.  Nearest neighbor queries , 1995, SIGMOD '95.

[45]  Andrew McGregor,et al.  CLARO: modeling and processing uncertain data streams , 2012, The VLDB Journal.

[46]  Anna Liu,et al.  PODS: a new model and processing algorithms for uncertain data streams , 2010, SIGMOD Conference.

[47]  Michael Zink,et al.  Capturing Data Uncertainty in High-Volume Stream Processing , 2009, CIDR.

[48]  Alain Biem,et al.  Body sensor data processing using stream computing , 2010, MIR '10.

[49]  Feifei Li,et al.  Finding frequent items in probabilistic data , 2008, SIGMOD Conference.

[50]  Amy L. Murphy,et al.  Is there light at the ends of the tunnel? Wireless sensor networks for adaptive lighting in road tunnels , 2011, Proceedings of the 10th ACM/IEEE International Conference on Information Processing in Sensor Networks.

[51]  Michael Stonebraker,et al.  Requirements for Science Data Bases and SciDB , 2009, CIDR.

[52]  Christian Böhm,et al.  The Gauss-Tree: Efficient Object Identification in Databases of Probabilistic Feature Vectors , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[53]  Gang Chen,et al.  Top-k Similarity Search on Uncertain Trajectories , 2011, SSDBM.

[54]  Rocco A. Servedio,et al.  Learning Poisson Binomial Distributions , 2011, STOC '12.

[55]  Eamonn J. Keogh,et al.  iSAX 2.0: Indexing and Mining One Billion Time Series , 2010, 2010 IEEE International Conference on Data Mining.

[56]  Susanne E. Hambrusch,et al.  Database Support for Probabilistic Attributes and Tuples , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[57]  Susanne E. Hambrusch,et al.  Orion 2.0: native support for uncertain data , 2008, SIGMOD Conference.

[58]  Eamonn J. Keogh,et al.  Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases , 2001, Knowledge and Information Systems.

[59]  Christopher Ré,et al.  Event queries on correlated probabilistic streams , 2008, SIGMOD Conference.

[60]  Ada Wai-Chee Fu,et al.  Efficient time series matching by wavelets , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[61]  Himanshu Gupta,et al.  Slotted Scheduled Tag Access in Multi-Reader RFID Systems , 2007, 2007 IEEE International Conference on Network Protocols.

[62]  Philip S. Yu,et al.  Challenges and Experience in Prototyping a Multi-Modal Stream Analytic and Monitoring Application on System S , 2007, VLDB.

[63]  Amol Deshpande,et al.  Online Filtering, Smoothing and Probabilistic Modeling of Streaming data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[64]  Charu C. Aggarwal On Unifying Privacy and Uncertain Data Models , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[65]  Christopher Ré,et al.  Efficient Top-k Query Evaluation on Probabilistic Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[66]  Philip S. Yu,et al.  Privacy-preserving data publishing: A survey of recent developments , 2010, CSUR.

[67]  Feifei Li,et al.  Semantics of Ranking Queries for Probabilistic Data , 2011, IEEE Transactions on Knowledge and Data Engineering.

[68]  Yang-Sae Moon,et al.  General match: a subsequence matching method in time-series databases based on generalized windows , 2002, SIGMOD '02.

[69]  Eamonn J. Keogh,et al.  Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping , 2012, KDD.

[70]  Yili Hong,et al.  On Computing the Distribution Function for the Sum of Independent and Non-identical Random Indicators Yili Hong , 2011 .

[71]  Eamonn J. Keogh,et al.  Atomic wedgie: efficient query filtering for streaming time series , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[72]  Philip S. Yu,et al.  Time Series Compressibility and Privacy , 2007, VLDB.

[73]  Philip S. Yu,et al.  On wavelet decomposition of uncertain time series data sets , 2010, CIKM.

[74]  Ambuj K. Singh,et al.  APLA: Indexing Arbitrary Probability Distributions , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[75]  Philip S. Yu,et al.  A Survey of Uncertain Data Algorithms and Applications , 2009, IEEE Transactions on Knowledge and Data Engineering.

[76]  Yufei Tao,et al.  Probabilistic Spatial Queries on Existentially Uncertain Data , 2005, SSTD.

[77]  Dan Suciu,et al.  Embracing Uncertainty in Large-Scale Computational Astrophysics. , 2009, MUD 2009.

[78]  Li Wei,et al.  Experiencing SAX: a novel symbolic representation of time series , 2007, Data Mining and Knowledge Discovery.

[79]  Fabrizio Angiulli,et al.  Indexing Uncertain Data in General Metric Spaces , 2012, IEEE Transactions on Knowledge and Data Engineering.

[80]  Ihab F. Ilyas,et al.  Efficient search for the top-k probable nearest neighbors in uncertain databases , 2008, Proc. VLDB Endow..

[81]  Parag Agrawal,et al.  Trio: a system for data, uncertainty, and lineage , 2006, VLDB.

[82]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[83]  Hans-Peter Kriegel,et al.  Probabilistic Similarity Join on Uncertain Data , 2006, DASFAA.

[84]  Philip S. Yu,et al.  A Framework for Clustering Uncertain Data Streams , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[85]  Joseph Y. Halpern Reasoning about uncertainty , 2003 .

[86]  Hans-Peter Kriegel,et al.  Probabilistic frequent itemset mining in uncertain databases , 2009, KDD.

[87]  Christos Faloutsos,et al.  Analysis of the Clustering Properties of the Hilbert Space-Filling Curve , 2001, IEEE Trans. Knowl. Data Eng..

[88]  Dan Olteanu,et al.  Query language support for incomplete information in the MayBMS system , 2007, VLDB.

[89]  Ricardo A. Baeza-Yates,et al.  Searching in metric spaces , 2001, CSUR.

[90]  Yufei Tao,et al.  Indexing Multi-Dimensional Uncertain Data with Arbitrary Probability Density Functions , 2005, VLDB.

[91]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[92]  Chi-Yin Chow,et al.  Probabilistic Verifiers: Evaluating Constrained Nearest-Neighbor Queries over Uncertain Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[93]  Reynold Cheng,et al.  Mining uncertain data with probabilistic guarantees , 2010, KDD.

[94]  Smruti R. Sarangi,et al.  DUST: a generalized notion of similarity between uncertain time series , 2010, KDD.

[95]  M. Zuo,et al.  Optimal Reliability Modeling: Principles and Applications , 2002 .

[96]  Jennifer Widom,et al.  ULDBs: databases with uncertainty and lineage , 2006, VLDB.