AFFINITY: Efficiently querying statistical measures on time-series data

Computing statistical measures for large databases of time series is a fundamental primitive for querying and mining time-series data [1]-[6]. This primitive is gaining importance with the increasing number and rapid growth of time series databases. In this paper, we introduce a framework for efficient computation of statistical measures by exploiting the concept of affine relationships. Affine relationships can be used to infer statistical measures for time series, from other related time series, instead of computing them directly; thus, reducing the overall computational cost significantly. The resulting methods exhibit at least one order of magnitude improvement over the best known methods. To the best of our knowledge, this is the first work that presents an unified approach for computing and querying several statistical measures at once. Our approach exploits affine relationships using three key components. First, the AFCLST algorithm clusters the time-series data, such that high-quality affine relationships could be easily found. Second, the SYMEX algorithm uses the clustered time series and efficiently computes the desired affine relationships. Third, the SCAPE index structure produces a many-fold improvement in the performance of processing several statistical queries by seamlessly indexing the affine relationships. Finally, we establish the effectiveness of our approaches by performing comprehensive experimental evaluation on real datasets.

[1]  Eamonn J. Keogh,et al.  Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases , 2001, Knowledge and Information Systems.

[2]  Christopher Ré,et al.  Event queries on correlated probabilistic streams , 2008, SIGMOD Conference.

[3]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[4]  Intraday Correlation Patterns between the S&P 500 and Sector Indices , 2010 .

[5]  Alberto O. Mendelzon,et al.  Similarity-based queries for time series data , 1997, SIGMOD '97.

[6]  Clu-istos Foutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[7]  Suman Nath,et al.  Managing Massive Time Series Streams with MultiScale Compressed Trickles , 2009, Proc. VLDB Endow..

[8]  Amol Deshpande,et al.  Indexing correlated probabilistic databases , 2009, SIGMOD Conference.

[9]  Divesh Srivastava,et al.  On computing correlated aggregates over continual data streams , 2001, SIGMOD '01.

[10]  J. Hull Options, Futures, and Other Derivatives , 1989 .

[11]  Dennis Shasha,et al.  StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time , 2002, VLDB.

[12]  Wilfred Ng,et al.  Correlation search in graph databases , 2007, KDD '07.

[13]  W. Sharpe CAPITAL ASSET PRICES: A THEORY OF MARKET EQUILIBRIUM UNDER CONDITIONS OF RISK* , 1964 .

[14]  Richard Cole,et al.  Fast window correlations over uncooperative time series , 2005, KDD '05.

[15]  Jiang Wang,et al.  Trading Volume and Serial Correlation in Stock Returns , 1992 .

[16]  Philip S. Yu,et al.  HierarchyScan: a hierarchical similarity search algorithm for databases of long sequences , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[17]  C. R. Henson Conclusion , 1969 .

[18]  Kyuseok Shim,et al.  Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases , 1995, VLDB.

[19]  Alberto O. Mendelzon,et al.  Similarity-based queries , 1995, PODS '95.

[20]  Jie Liu,et al.  Fast approximate correlation for massive time-series data , 2010, SIGMOD Conference.

[21]  Hui Xiong,et al.  TAPER: a two-step approach for all-strong-pairs correlation query in large databases , 2006, IEEE Transactions on Knowledge and Data Engineering.

[22]  Jie Liu,et al.  GAMPS: compressing multi sensor data by grouping and amplitude scaling , 2009, SIGMOD Conference.

[23]  Jimeng Sun,et al.  Streaming Pattern Discovery in Multiple Time-Series , 2005, VLDB.

[24]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[25]  Xiang Lian,et al.  Efficient Similarity Search over Future Stream Time Series , 2008, IEEE Transactions on Knowledge and Data Engineering.

[26]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.