Generating data series query workloads

Data series (including time series) has attracted lots of interest in recent years. Most of the research has focused on how to efficiently support similarity or nearest neighbor queries over large data series collections (an important data mining task), and several data series summarization and indexing methods have been proposed in order to solve this problem. Up to this point, very little attention has been paid to properly evaluating such index structures, with most previous works relying solely on randomly selected data series to use as queries. In this work, we show that random workloads are inherently not suitable for the task at hand and we argue that there is a need for carefully generating query workloads. We define measures that capture the characteristics of queries, and we propose a method for generating workloads with the desired properties, that is, effectively evaluating and comparing data series summarizations and indexes. In our experimental evaluation, with carefully controlled query workloads, we shed light on key factors affecting the performance of nearest neighbor search in large data series collections. This is the first paper that introduces a method for quantifying hardness of data series queries, as well as the ability to generate queries of predefined hardness.

[1]  Eamonn J. Keogh Nearest Neighbor , 2010, Encyclopedia of Machine Learning.

[2]  Ada Wai-Chee Fu,et al.  Efficient time series matching by wavelets , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[3]  Patrick Schäfer,et al.  SFA: a symbolic fourier approximation and index for similarity search in high dimensional datasets , 2012, EDBT '12.

[4]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[5]  Eamonn J. Keogh,et al.  Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases , 2001, Knowledge and Information Systems.

[6]  Dina Q. Goldin,et al.  On Similarity Queries for Time-Series Data: Constraint Specification and Implementation , 1995, CP.

[7]  Katsiaryna Mirylenka,et al.  Uncertain Time-Series Similarity: Return to the Basics , 2012, Proc. VLDB Endow..

[8]  Themis Palpanas,et al.  Top-k Nearest Neighbor Search In Uncertain Data Series , 2014, Proc. VLDB Endow..

[9]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[10]  Eamonn J. Keogh,et al.  Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping , 2012, KDD.

[11]  Eamonn J. Keogh,et al.  Scaling up Dynamic Time Warping to Massive Dataset , 1999, PKDD.

[12]  Eamonn J. Keogh,et al.  Finding Time Series Discords Based on Haar Transform , 2006, ADMA.

[13]  Eamonn J. Keogh,et al.  iSAX 2.0: Indexing and Mining One Billion Time Series , 2010, 2010 IEEE International Conference on Data Mining.

[14]  Stephen D. Bay,et al.  The UCI KDD archive of large data sets for data mining research and experimentation , 2000, SKDD.

[15]  Eamonn J. Keogh,et al.  Time series shapelets: a new primitive for data mining , 2009, KDD.

[16]  Dennis Shasha,et al.  Tuning Time Series Queries in Finance: Case Studies and Recommendations , 1999, IEEE Data Eng. Bull..

[17]  Themis Palpanas,et al.  RINSE: Interactive Data Series Exploration with ADS+ , 2015, Proc. VLDB Endow..

[18]  Dimitrios Gunopulos,et al.  Mining Time Series Data , 2005, Data Mining and Knowledge Discovery Handbook.

[19]  Eamonn J. Keogh,et al.  Experimental comparison of representation methods and distance measures for time series data , 2010, Data Mining and Knowledge Discovery.

[20]  Dimitrios Gunopulos,et al.  Finding Similar Time Series , 1997, PKDD.

[21]  Eamonn J. Keogh,et al.  Beyond one billion time series: indexing and mining very large time series collections with $$i$$SAX2+ , 2013, Knowledge and Information Systems.

[22]  Yunhao Liu,et al.  Indexable PLA for Efficient Similarity Search , 2007, VLDB.

[23]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[24]  Yuan Li,et al.  Rotation-invariant similarity in time series using bag-of-patterns representation , 2012, Journal of Intelligent Information Systems.

[25]  Alberto O. Mendelzon,et al.  Efficient Retrieval of Similar Time Sequences Using DFT , 1998, FODO.

[26]  Jian Pei,et al.  A Data-adaptive and Dynamic Segmentation Index for Whole Matching on Time Series , 2013, Proc. VLDB Endow..

[27]  Pavlos Protopapas,et al.  Computational Intelligence Challenges and Applications on Large-Scale Astronomical Time Series Databases , 2014, IEEE Computational Intelligence Magazine.

[28]  Walid G. Aref,et al.  Query Indexing and Velocity Constrained Indexing: Scalable Techniques for Continuous Queries on Moving Objects , 2002, IEEE Trans. Computers.

[29]  Ambuj K. Singh,et al.  Dimensionality reduction for similarity searching in dynamic databases , 1998, SIGMOD '98.

[30]  Panagiotis Karras,et al.  Scalable kNN search on vertically stored time series , 2011, KDD.

[31]  Ira Assent,et al.  The TS-tree: efficient time series search and retrieval , 2008, EDBT '08.

[32]  Clu-istos Foutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[33]  Christos Faloutsos,et al.  Efficiently supporting ad hoc queries in large datasets of time sequences , 1997, SIGMOD '97.

[34]  Chi-Yin Chow,et al.  Query-aware location anonymization for road networks , 2011, GeoInformatica.

[35]  Eamonn J. Keogh,et al.  iSAX: indexing and mining terabyte sized time series , 2008, KDD.

[36]  Eamonn J. Keogh,et al.  A symbolic representation of time series, with implications for streaming algorithms , 2003, DMKD '03.

[37]  Li Wei,et al.  Experiencing SAX: a novel symbolic representation of time series , 2007, Data Mining and Knowledge Discovery.

[38]  Ira Assent,et al.  Efficient Processing of Multiple DTW Queries in Time Series Databases , 2011, SSDBM.

[39]  Eamonn J. Keogh,et al.  The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances , 2016, Data Mining and Knowledge Discovery.

[40]  Themis Palpanas,et al.  Indexing for interactive exploration of big data series , 2014, SIGMOD Conference.

[41]  Johannes Gehrke,et al.  Query Workloads for Data Series Indexes , 2015, KDD.

[42]  Christos Faloutsos,et al.  Efficient retrieval of similar time sequences under time warping , 1998, Proceedings 14th International Conference on Data Engineering.

[43]  Philip S. Yu,et al.  HierarchyScan: a hierarchical similarity search algorithm for databases of long sequences , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[44]  Kunio Kashino,et al.  Time-series active search for quick retrieval of audio and video , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[45]  Alberto O. Mendelzon,et al.  Similarity-based queries for time series data , 1997, SIGMOD '97.