Fast window correlations over uncooperative time series

Data arriving in time order (a data stream) arises in fields including physics, finance, medicine, and music, to name a few. Often the data comes from sensors (in physics and medicine for example) whose data rates continue to improve dramatically as sensor technology improves. Further, the number of sensors is increasing, so correlating data between sensors becomes ever more critical in order to distill knowlege from the data. In many applications such as finance, recent correlations are of far more interest than long-term correlation, so correlation over sliding windows (windowed correlation) is the desired operation. Fast response is desirable in many applications (e.g., to aim a telescope at an activity of interest or to perform a stock trade). These three factors -- data size, windowed correlation, and fast response -- motivate this work.Previous work [10, 14] showed how to compute Pearson correlation using Fast Fourier Transforms and Wavelet transforms, but such techniques don't work for time series in which the energy is spread over many frequency components, thus resembling white noise. For such "uncooperative" time series, this paper shows how to combine several simple techniques -- sketches (random projections), convolution, structured random vectors, grid structures, and combinatorial design -- to achieve high performance windowed Pearson correlation over a variety of data sets.

[1]  Dennis Shasha,et al.  StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time , 2002, VLDB.

[2]  Renée J. Miller,et al.  Similarity search over time-series data using wavelets , 2002, Proceedings 18th International Conference on Data Engineering.

[3]  Dimitris Achlioptas,et al.  Database-friendly random projections , 2001, PODS.

[4]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[5]  Dennis Shasha,et al.  High Performance Discovery in Time Series , 2004, Monographs in Computer Science.

[6]  Rafail Ostrovsky,et al.  Efficient search for approximate nearest neighbor in high dimensional spaces , 1998, STOC '98.

[7]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[8]  Pierre Giot,et al.  Market Models: A Guide to Financial Data Analysis , 2003 .

[9]  Alberto O. Mendelzon,et al.  Similarity-based queries for time series data , 1997, SIGMOD '97.

[10]  Sudipto Guha,et al.  Dynamic multidimensional histograms , 2002, SIGMOD '02.

[11]  Eamonn Keogh Exact Indexing of Dynamic Time Warping , 2002, VLDB.

[12]  Christos Faloutsos,et al.  Fast Time Sequence Indexing for Arbitrary Lp Norms , 2000, VLDB.

[13]  Dimitrios Gunopulos,et al.  Indexing multi-dimensional time-series with support for multiple distance measures , 2003, KDD '03.

[14]  Petros Drineas,et al.  An Experimental Evaluation of a Monte-Carlo Algorithm for Singular Value Decomposition , 2001, Panhellenic Conference on Informatics.

[15]  Eamonn J. Keogh,et al.  Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases , 2001, Knowledge and Information Systems.

[16]  Piotr Indyk,et al.  Fast mining of massive tabular data via approximate distance computations , 2002, Proceedings 18th International Conference on Data Engineering.

[17]  Ada Wai-Chee Fu,et al.  Efficient time series matching by wavelets , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[18]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[19]  Patrick S. Huggins,et al.  A Randomized Singular Value Decomposition Algorithm for Image Processing Applications , 2004 .

[20]  Philip S. Yu,et al.  HierarchyScan: a hierarchical similarity search algorithm for databases of long sequences , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[21]  S. Muthukrishnan,et al.  Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries , 2001, VLDB.

[22]  Piotr Indyk,et al.  Identifying Representative Trends in Massive Time Series Data Sets Using Sketches , 2000, VLDB.

[23]  Wang Rui Similarity search over time series data using DCT , 2007 .

[24]  Piotr Indyk,et al.  Stable distributions, pseudorandom generators, embeddings, and data stream computation , 2006, JACM.

[25]  Dennis Shasha,et al.  High Performance Discovery In Time Series: Techniques And Case Studies (Monographs in Computer Science) , 2004 .

[26]  Eamonn J. Keogh,et al.  Locally adaptive dimensionality reduction for indexing large time series databases , 2001, SIGMOD '01.

[27]  Dimitrios Gunopulos,et al.  Online amnesic approximation of streaming time series , 2004, Proceedings. 20th International Conference on Data Engineering.

[28]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[29]  Sanjeev Khanna,et al.  Space-efficient online computation of quantile summaries , 2001, SIGMOD '01.

[30]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[31]  R Agrawal,et al.  Fast mining of massive tabular data via approximate distance computations , 2002 .

[32]  Eamonn J. Keogh,et al.  UCR Time Series Data Mining Archive , 1983 .

[33]  Divyakant Agrawal,et al.  A comparison of DFT and DWT based similarity search in time-series databases , 2000, CIKM '00.

[34]  Bruce G. Lindsay,et al.  Random sampling techniques for space efficient online computation of order statistics of large datasets , 1999, SIGMOD '99.

[35]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[36]  D.M. Cohen,et al.  The Combinatorial Design Approach to Automatic Test Generation , 1996, IEEE Softw..

[37]  Clu-istos Foutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[38]  Christos Faloutsos,et al.  Efficiently supporting ad hoc queries in large datasets of time sequences , 1997, SIGMOD '97.