On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration

In the last decade there has been an explosion of interest in mining time series data. Literally hundreds of papers have introduced new algorithms to index, classify, cluster and segment time series. In this work we make the following claim. Much of this work has very little utility because the contribution made (speed in the case of indexing, accuracy in the case of classification and clustering, model accuracy in the case of segmentation) offer an amount of “improvement” that would have been completely dwarfed by the variance that would have been observed by testing on many real world datasets, or the variance that would have been observed by changing minor (unstated) implementation details.To illustrate our point, we have undertaken the most exhaustive set of time series experiments ever attempted, re-implementing the contribution of more than two dozen papers, and testing them on 50 real world, highly diverse datasets. Our empirical results strongly support our assertion, and suggest the need for a set of time series benchmarks and more careful empirical evaluation in the data mining community.

[1]  Christos Faloutsos,et al.  FALCON: Feedback Adaptive Loop for Content-Based Retrieval , 2000, VLDB.

[2]  Man Hon Wong,et al.  Fast time-series searching with scaling and shifting , 1999, PODS '99.

[3]  Henrik André-Jönsson,et al.  Using Signature Files for Querying Time-Series Data , 1997, PKDD.

[4]  Georges Hébrail,et al.  Interactive Interpretation of Kohonen Maps Applied to Curves , 1998, KDD.

[5]  Pierre Geurts,et al.  Pattern Extraction for Time Series Classification , 2001, PKDD.

[6]  Changzhou Wang,et al.  Multilevel Filtering for High Dimensional Nearest Neighbor Search , 2000, ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[7]  Piotr Indyk,et al.  Mining the stock market (extended abstract): which measure is best? , 2000, KDD '00.

[8]  Davood Rafiei,et al.  On similarity-based queries for time series data , 1997, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[9]  W. Chu,et al.  Fast retrieval of similar subsequences in long sequence databases , 1999, Proceedings 1999 Workshop on Knowledge and Data Engineering Exchange (KDEX'99) (Cat. No.PR00453).

[10]  Konstantinos Kalpakis,et al.  Distance measures for effective clustering of ARIMA time-series , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[11]  Jiawei Han,et al.  AIM: Approximate Intelligent Matching for Time Series Data , 2000, DaWaK.

[12]  Deok-Hwan Kim,et al.  Similarity search for multidimensional data sequences , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[13]  Eugene Fink,et al.  Search for Patterns in Compressed Time Series , 2002, Int. J. Image Graph..

[14]  Changzhou Wang,et al.  Supporting content-based searches on time series via approximation , 2000, Proceedings. 12th International Conference on Scientific and Statistica Database Management.

[15]  Jaideep Srivastava,et al.  Event detection from time series data , 1999, KDD '99.

[16]  Zbigniew R. Struzik,et al.  The Haar Wavelet Transform in the Time Series Similarity Paradigm , 1999, PKDD.

[17]  Eamonn J. Keogh,et al.  An Enhanced Representation of Time Series Which Allows Fast and Accurate Classification, Clustering and Relevance Feedback , 1998, KDD.

[18]  Divyakant Agrawal,et al.  A comparison of DFT and DWT based similarity search in time-series databases , 2000, CIKM '00.

[19]  Kyuseok Shim,et al.  Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases , 1995, VLDB.

[20]  Eamonn J. Keogh,et al.  A Probabilistic Approach to Fast Pattern Matching in Time Series Databases , 1997, KDD.

[21]  Ambuj K. Singh,et al.  An efficient index structure for shift and scale invariant search of mufti-attribute time sequences , 2002, Proceedings 18th International Conference on Data Engineering.

[22]  Wesley W. Chu,et al.  An index-based approach for similarity search supporting time warping in large sequence databases , 2001, Proceedings 17th International Conference on Data Engineering.

[23]  Andrew Davison,et al.  Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers , 1995 .

[24]  Philip S. Yu,et al.  MALM: a framework for mining sequence database at multiple abstraction levels , 1998, CIKM '98.

[25]  Ambuj K. Singh,et al.  Variable length queries for time series data , 2001, Proceedings 17th International Conference on Data Engineering.

[26]  Kyoji Kawagoe,et al.  A similarity search method of time series data with combination of Fourier and wavelet transforms , 2002, Proceedings Ninth International Symposium on Temporal Representation and Reasoning.

[27]  Sang-Wook Kim,et al.  Index interpolation: an approach to subsequence matching supporting normalization transform in time-series databases , 2000, CIKM '00.

[28]  Changzhou Wang,et al.  Supporting fast search in time series for movement patterns in multiple scales , 1998, CIKM '98.

[29]  Dragomir Anguelov,et al.  Mining The Stock Market : Which Measure Is Best ? , 2000 .

[30]  Lutz Prechelt A quantitative study of neural network learning algorithm evaluation practices , 1995 .

[31]  Christos Faloutsos,et al.  Efficient retrieval of similar time sequences under time warping , 1998, Proceedings 14th International Conference on Data Engineering.

[32]  Philip S. Yu,et al.  Adaptive query processing for time-series data , 1999, KDD '99.

[33]  Changzhou Wang,et al.  Supporting subseries nearest neighbor search via approximation , 2000, CIKM '00.

[34]  Christos Faloutsos,et al.  A signature technique for similarity-based queries , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[35]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[36]  William W. Cohen Efficient Pruning Methods for Separate-and-Conquer Rule Learning Systems , 1993, IJCAI.

[37]  Dina Q. Goldin,et al.  On Similarity Queries for Time-Series Data: Constraint Specification and Implementation , 1995, CP.

[38]  Juan Pedro Caraça-Valente,et al.  Discovering similar patterns in time series , 2000, KDD '00.

[39]  Clu-istos Foutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[40]  Pat Langley,et al.  Machine learning as an experimental science , 2004, Machine Learning.

[41]  Christos Faloutsos,et al.  Efficiently supporting ad hoc queries in large datasets of time sequences , 1997, SIGMOD '97.

[42]  Nasser Yazdani,et al.  Matching and indexing sequences of different lengths , 1997, CIKM '97.

[43]  Eamonn J. Keogh,et al.  Locally adaptive dimensionality reduction for indexing large time series databases , 2001, SIGMOD '01.

[44]  Hagit Shatkay,et al.  Approximate queries and representations for large data sequences , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[45]  Christos Faloutsos,et al.  Fast Time Sequence Indexing for Arbitrary Lp Norms , 2000, VLDB.

[46]  Donald J. Berndt,et al.  Finding Patterns in Time Series: A Dynamic Programming Approach , 1996, Advances in Knowledge Discovery and Data Mining.

[47]  Man Hon Wong,et al.  Efficient and robust feature extraction and pattern matching of time series by a lattice structure , 2001, CIKM '01.

[48]  Dimitrios Gunopulos,et al.  Finding Similar Time Series , 1997, PKDD.

[49]  David H. Bailey,et al.  Twelve ways to fool the masses when giving performance results on parallel computers , 1991 .

[50]  Heikki Mannila,et al.  Rule Discovery from Time Series , 1998, KDD.

[51]  Cyrus Shahabi,et al.  TSA-tree: a wavelet-based approach to improve the efficiency of multi-level surprise and trend queries on time-series data , 2000, Proceedings. 12th International Conference on Scientific and Statistica Database Management.

[52]  Eamonn J. Keogh,et al.  Exact indexing of dynamic time warping , 2002, Knowledge and Information Systems.

[53]  Divyakant Agrawal,et al.  Approximate nearest neighbor searching in multimedia databases , 2001, Proceedings 17th International Conference on Data Engineering.

[54]  Alberto O. Mendelzon,et al.  Efficient Retrieval of Similar Time Sequences Using DFT , 1998, FODO.

[55]  Hannu Toivonen,et al.  Mining for similarities in aligned time series using wavelets , 1999, Defense, Security, and Sensing.

[56]  Julian L. Simon What Some Puzzling Problems Teach about the Theory of Simulation and the Use of Resampling , 1994 .

[57]  Ada Wai-Chee Fu,et al.  Efficient time series matching by wavelets , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[58]  Padhraic Smyth,et al.  Deformable Markov model templates for time-series pattern matching , 2000, KDD '00.

[59]  Giuseppe Psaila,et al.  Querying Shapes of Histories , 1995, VLDB.

[60]  Man Hon Wong,et al.  A Fast Projection Algorithm for Sequence Data Searching , 1998, Data Knowl. Eng..

[61]  Piotr Indyk,et al.  Identifying Representative Trends in Massive Time Series Data Sets Using Sketches , 2000, VLDB.

[62]  Wesley W. Chu,et al.  Efficient searches for similar subsequences of different lengths in sequence databases , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[63]  Renée J. Miller,et al.  Similarity search over time-series data using wavelets , 2002, Proceedings 18th International Conference on Data Engineering.