Addressing Big Data Time Series: Mining Trillions of Time Series Subsequences Under Dynamic Time Warping

Most time series data mining algorithms use similarity search as a core subroutine, and thus the time taken for similarity search is the bottleneck for virtually all time series data mining algorithms, including classification, clustering, motif discovery, anomaly detection, and so on. The difficulty of scaling a search to large datasets explains to a great extent why most academic work on time series data mining has plateaued at considering a few millions of time series objects, while much of industry and science sits on billions of time series objects waiting to be explored. In this work we show that by using a combination of four novel ideas we can search and mine massive time series for the first time. We demonstrate the following unintuitive fact: in large datasets we can exactly search under Dynamic Time Warping (DTW) much more quickly than the current state-of-the-art Euclidean distance search algorithms. We demonstrate our work on the largest set of time series experiments ever attempted. In particular, the largest dataset we consider is larger than the combined size of all of the time series datasets considered in all data mining papers ever published. We explain how our ideas allow us to solve higher-level time series data mining problems such as motif discovery and clustering at scales that would otherwise be untenable. Moreover, we show how our ideas allow us to efficiently support the uniform scaling distance measure, a measure whose utility seems to be underappreciated, but which we demonstrate here. In addition to mining massive datasets with up to one trillion datapoints, we will show that our ideas also have implications for real-time monitoring of data streams, allowing us to handle much faster arrival rates and/or use cheaper and lower powered devices than are currently possible.

[1]  R. F. Ling Comparison of Several Algorithms for Computing Sample Means and Variances , 1974 .

[2]  Mike Paterson,et al.  A Faster Algorithm Computing String Edit Distances , 1980, J. Comput. Syst. Sci..

[3]  Gene H. Golub,et al.  Algorithms for Computing the Sample Variance: Analysis and Recommendations , 1983 .

[4]  Allen Gersho,et al.  Fast search algorithms for vector quantization and pattern matching , 1984, ICASSP.

[5]  Robert M. Gray,et al.  An Improvement of the Minimum Distortion Encoding Algorithm for Vector Quantization , 1985, IEEE Trans. Commun..

[6]  David Goldberg What Every Computer Scientist Should Know About Floating-Point Arithmetic , 1992 .

[7]  Christos Faloutsos,et al.  Efficient retrieval of similar time sequences under time warping , 1998, Proceedings 14th International Conference on Data Engineering.

[8]  J. Mcnames Rotated partial distance search for faster vector quantization encoding , 2000, IEEE Signal Processing Letters.

[9]  Wesley W. Chu,et al.  An index-based approach for similarity search supporting time warping in large sequence databases , 2001, Proceedings 17th International Conference on Data Engineering.

[10]  Eamonn Keogh Exact Indexing of Dynamic Time Warping , 2002, VLDB.

[11]  Dimitrios Gunopulos,et al.  Indexing multi-dimensional time-series with support for multiple distance measures , 2003, KDD '03.

[12]  Mika P. Tarvainen,et al.  High-Resolution QRS Detection Algorithm for Sparsely Sampled ECG Recordings , 2004 .

[13]  Dimitrios Gunopulos,et al.  Indexing Large Human-Motion Databases , 2004, VLDB.

[14]  Ambuj K. Singh,et al.  Optimizing similarity search for arbitrary length time series queries , 2004, IEEE Transactions on Knowledge and Data Engineering.

[15]  Eamonn J. Keogh,et al.  On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration , 2002, Data Mining and Knowledge Discovery.

[16]  Lei Chen,et al.  On The Marriage of Lp-norms and Edit Distance , 2004, VLDB.

[17]  Gregory H. Wakefield,et al.  Iterative Deepening for Melody Alignment and Retrieval , 2005, ISMIR.

[18]  Christos Faloutsos,et al.  FTW: fast similarity search under the time warping distance , 2005, PODS.

[19]  M. Tarvainen,et al.  High-resolution QRS fiducial point corrections in sparsely sampled ECG recordings. , 2005, Physiological measurement.

[20]  S. Venkatesh,et al.  Online Context Recognition in Multisensor Systems using Dynamic Time Warping , 2005, 2005 International Conference on Intelligent Sensors, Sensor Networks and Information Processing.

[21]  Eamonn J. Keogh,et al.  Scaling and time warping in time series querying , 2005, The VLDB Journal.

[22]  Sang-Wook Kim,et al.  Using multiple indexes for efficient subsequence matching in time-series databases , 2006, Inf. Sci..

[23]  Gerhard Tröster,et al.  Gestures are strings: efficient online gesture spotting and classification using string matching , 2007, BODYNETS.

[24]  Alicia Fornés,et al.  Old Handwritten Musical Symbol Classification by a Dynamic Time Warping Based Method , 2008, GREC.

[25]  Christos Faloutsos,et al.  Stream Monitoring under the Time Warping Distance , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[26]  Yang Li,et al.  Gestures without libraries, toolkits or training: a $1 recognizer for user interface prototypes , 2007, UIST.

[27]  Jiawei Han,et al.  ACM Transactions on Knowledge Discovery from Data: Introduction , 2007 .

[28]  William B. S. Pressly TSPad: a Tablet-PC based application for annotation and collaboration on time series data , 2008, ACM-SE 46.

[29]  Pavlos Protopapas,et al.  Supporting exact indexing of arbitrarily rotated shapes and periodic time series under Euclidean and warping distance measures , 2008, The VLDB Journal.

[30]  Ira Assent,et al.  The TS-tree: efficient time series search and retrieval , 2008, EDBT '08.

[31]  Pavlos Protopapas,et al.  Finding anomalous periodic time series , 2009, Machine Learning.

[32]  Hui Ding,et al.  Querying and mining of time series data: experimental comparison of representations and distance measures , 2008, Proc. VLDB Endow..

[33]  Eamonn J. Keogh,et al.  iSAX: indexing and mining terabyte sized time series , 2008, KDD.

[34]  Gang Chen,et al.  Efficient Processing of Warping Time Series Join of Motion Capture Data , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[35]  Stan Sclaroff,et al.  A Unified Framework for Gesture Recognition and Spatiotemporal Gesture Segmentation , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Eamonn J. Keogh,et al.  Time series shapelets: a new primitive for data mining , 2009, KDD.

[37]  Meinard Müller,et al.  Analysis and Retrieval Techniques for Motion and Music Data , 2009, Eurographics.

[38]  Bernt Schiele,et al.  Enabling Efficient Time Series Analysis for Wearable Activity Data , 2009, 2009 International Conference on Machine Learning and Applications.

[39]  Martin Kampel,et al.  Identification of ancient coins based on fusion of shape and local features , 2011, Machine Vision and Applications.

[40]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[41]  Eamonn J. Keogh,et al.  Online discovery and maintenance of time series motifs , 2010, KDD.

[42]  Eamonn J. Keogh,et al.  A disk-aware algorithm for time series motif discovery , 2011, Data Mining and Knowledge Discovery.

[43]  Tele Tan,et al.  Classifying eye and head movement artifacts in EEG signals , 2011, 5th IEEE International Conference on Digital Ecosystems and Technologies (IEEE DEST 2011).

[44]  Deep Bera,et al.  Cardiac arrhythmia detection using dynamic time warping of ECG beats in e-healthcare systems , 2011, 2011 IEEE International Symposium on a World of Wireless, Mobile and Multimedia Networks.

[45]  Arvind Kumar,et al.  Implementing the dynamic time warping algorithm in multithreaded environments for real time and unsupervised pattern discovery , 2011, 2011 2nd International Conference on Computer and Communication Technology (ICCCT-2011).

[46]  M. Sile O'Modhrain,et al.  Recognition Of Multivariate Temporal Musical Gestures Using N-Dimensional Dynamic Time Warping , 2011, NIME.

[47]  Albert J. Vilella,et al.  Comparative and demographic analysis of orang-utan genomes , 2011, Nature.

[48]  Dimitrios Gunopulos,et al.  Embedding-based subsequence matching in time-series databases , 2011, TODS.