DUST: a generalized notion of similarity between uncertain time series

Large-scale sensor deployments and an increased use of privacy-preserving transformations have led to an increasing interest in mining uncertain time series data. Traditional distance measures such as Euclidean distance or dynamic time warping are not always effective for analyzing uncertain time series data. Recently, some measures have been proposed to account for uncertainty in time series data. However, we show in this paper that their applicability is limited. In specific, these approaches do not provide an intuitive way to compare two uncertain time series and do not easily accommodate multiple error functions. In this paper, we provide a theoretical framework that generalizes the notion of similarity between uncertain time series. Secondly, we propose DUST, a novel distance measure that accommodates uncertainty and degenerates to the Euclidean distance when the distance is large compared to the error. We provide an extensive experimental validation of our approach for the following applications: classification, top-k motif search, and top-k nearest-neighbor queries.

[1]  Philip S. Yu,et al.  A Survey of Uncertain Data Algorithms and Applications , 2009, IEEE Transactions on Knowledge and Data Engineering.

[2]  Curtis A. Shively,et al.  A Method of Over Bounding Ground-Based Augmentation System (GBAS) Heavy Tail Error Distributions , 2004 .

[3]  Jennifer Widom,et al.  Representing uncertain data: models, properties, and algorithms , 2009, The VLDB Journal.

[4]  Donald J. Berndt,et al.  Using Dynamic Time Warping to Find Patterns in Time Series , 1994, KDD Workshop.

[5]  Philip S. Yu,et al.  A Framework for Clustering Uncertain Data Streams , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[6]  James Parker,et al.  on Knowledge and Data Engineering, , 1990 .

[7]  J. J. Sudano Dynamic real-time sensor performance evaluation , 2002, Proceedings of the Fifth International Conference on Information Fusion. FUSION 2002. (IEEE Cat.No.02EX5997).

[8]  Eamonn J. Keogh,et al.  Clustering of time-series subsequences is meaningless: implications for previous and future research , 2004, Knowledge and Information Systems.

[9]  Clu-istos Foutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[10]  Philip S. Yu,et al.  PROUD: a probabilistic approach to processing similarity queries over uncertain data streams , 2009, EDBT '09.

[11]  Sungrae Cho,et al.  Bidirectional Data Aggregation Scheme for Wireless Sensor Networks , 2006, UIC.

[12]  Hui Ding,et al.  Querying and mining of time series data: experimental comparison of representations and distance measures , 2008, Proc. VLDB Endow..

[13]  Li Wei,et al.  Experiencing SAX: a novel symbolic representation of time series , 2007, Data Mining and Knowledge Discovery.

[14]  Sarbani Palit,et al.  Signal extraction from multiple noisy sensors , 1997, Signal Process..

[15]  Hans-Peter Kriegel,et al.  Probabilistic Similarity Search for Uncertain Time Series , 2009, SSDBM.

[16]  Eamonn J. Keogh,et al.  Exact Discovery of Time Series Motifs , 2009, SDM.

[17]  Ramesh Govindan,et al.  On the Prevalence of Sensor Faults in Real-World Deployments , 2007, 2007 4th Annual IEEE Communications Society Conference on Sensor, Mesh and Ad Hoc Communications and Networks.

[18]  Minos N. Garofalakis,et al.  Adaptive cleaning for RFID data streams , 2006, VLDB.