RadiusSketch: Massively Distributed Indexing of Time Series

Performing similarity queries on hundreds of millions of time series is a challenge requiring both efficient indexing techniques and parallelization. We propose a sketch/random projection-based approach that scales nearly linearly in parallel environments, and provides high quality answers. We illustrate the performance of our approach, called RadiusSketch, on real and synthetic datasets of up to 1 Terabytes and 500 million time series. The sketch method, as we have implemented, is superior in both quality and response time compared with the state of the art approach, iSAX2+. Already, in the sequential case it improves recall and precision by a factor of two, while giving shorter response times. In a parallel environment with 32 processors, on both real and synthetic data, our parallel approach improves by a factor of up to 100 in index time construction and up to 15 in query answering time. Finally, our data structure makes use of idle computing time to improve the recall and precision yet further.

[1]  Piotr Indyk,et al.  Fast mining of massive tabular data via approximate distance computations , 2002, Proceedings 18th International Conference on Data Engineering.

[2]  Richard Cole,et al.  Fast window correlations over uncooperative time series , 2005, KDD '05.

[3]  Dimitris Achlioptas,et al.  Database-friendly random projections: Johnson-Lindenstrauss with binary coins , 2003, J. Comput. Syst. Sci..

[4]  Piotr Indyk,et al.  Stable distributions, pseudorandom generators, embeddings, and data stream computation , 2006, JACM.

[5]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[6]  Dennis Shasha,et al.  High Performance Discovery In Time Series: Techniques And Case Studies (Monographs in Computer Science) , 2004 .

[7]  Themis Palpanas,et al.  Indexing for interactive exploration of big data series , 2014, SIGMOD Conference.

[8]  Eamonn J. Keogh,et al.  iSAX: indexing and mining terabyte sized time series , 2008, KDD.

[9]  Raymond T. Ng,et al.  Indexing spatio-temporal trajectories with Chebyshev polynomials , 2004, SIGMOD '04.

[10]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[11]  Clu-istos Foutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[12]  Ada Wai-Chee Fu,et al.  Efficient time series matching by wavelets , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[13]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[14]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[15]  P. Moran,et al.  Mathematics of Statistics , 1948, Nature.

[16]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[17]  Alan L. Cox,et al.  The Hadoop distributed filesystem: Balancing portability and performance , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[18]  Rafail Ostrovsky,et al.  Efficient search for approximate nearest neighbor in high dimensional spaces , 1998, STOC '98.

[19]  Eamonn J. Keogh,et al.  Beyond one billion time series: indexing and mining very large time series collections with $$i$$SAX2+ , 2013, Knowledge and Information Systems.

[20]  Eamonn J. Keogh,et al.  Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases , 2001, Knowledge and Information Systems.

[21]  Sudipto Guha,et al.  Dynamic multidimensional histograms , 2002, SIGMOD '02.

[22]  Ira Assent,et al.  The TS-tree: efficient time series search and retrieval , 2008, EDBT '08.

[23]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[24]  Li Wei,et al.  Experiencing SAX: a novel symbolic representation of time series , 2007, Data Mining and Knowledge Discovery.

[25]  S. Muthukrishnan,et al.  Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries , 2001, VLDB.

[26]  Piotr Indyk,et al.  Identifying Representative Trends in Massive Time Series Data Sets Using Sketches , 2000, VLDB.