Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy

Clustering time series is a useful operation in its own right, and an important subroutine in many higher-level data mining analyses, including data editing for classifiers, summarization, and outlier detection. While it has been noted that the general superiority of Dynamic Time Warping (DTW) over Euclidean Distance for similarity search diminishes as we consider ever larger datasets, as we shall show, the same is not true for clustering. Thus, clustering time series under DTW remains a computationally challenging task. In this work, we address this lethargy in two ways. We propose a novel pruning strategy that exploits both upper and lower bounds to prune off a large fraction of the expensive distance calculations. This pruning strategy is admissible; giving us provably identical results to the brute force algorithm, but is at least an order of magnitude faster. For datasets where even this level of speedup is inadequate, we show that we can use a simple heuristic to order the unavoidable calculations in a most-useful-first ordering, thus casting the clustering as an anytime algorithm. We demonstrate the utility of our ideas with both single and multidimensional case studies in the domains of astronomy, speech physiology, medicine and entomology.

[1]  Eamonn J. Keogh,et al.  Rare Time Series Motif Discovery from Unbounded Streams , 2014, Proc. VLDB Endow..

[2]  Eamonn J. Keogh,et al.  A Novel Approximation to Dynamic Time Warping allows Anytime Clustering of Massive Time Series Datasets , 2012, SDM.

[3]  Jeffrey M. Hausdorff,et al.  Physionet: Components of a New Research Resource for Complex Physiologic Signals". Circu-lation Vol , 2000 .

[4]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[5]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[6]  Blaine A. Price,et al.  Wearables: has the age of smartwatches finally arrived? , 2015, Commun. ACM.

[7]  Christian Böhm,et al.  Efficient Anytime Density-based Clustering , 2013, SDM.

[8]  Ira Assent,et al.  AnyOut: Anytime Outlier Detection on Streaming Data , 2012, DASFAA.

[9]  M. Saeed Multiparameter Intelligent Monitoring in Intensive Care II ( MIMIC-II ) : A public-access intensive care unit database , 2011 .

[10]  Eamonn J. Keogh,et al.  iSAX: indexing and mining terabyte sized time series , 2008, KDD.

[11]  Eamonn J. Keogh,et al.  Time Series Epenthesis: Clustering Time Series Streams Requires Ignoring Some Data , 2011, 2011 IEEE 11th International Conference on Data Mining.

[12]  Charu C. Aggarwal,et al.  Data Clustering , 2013 .

[13]  Horst Bischof,et al.  Person Re-identification by Descriptive and Discriminative Classification , 2011, SCIA.

[14]  Alberto Maria Segre,et al.  The Use of Twitter to Track Levels of Disease Activity and Public Concern in the U.S. during the Influenza A H1N1 Pandemic , 2011, PloS one.

[15]  Sivaraman Balakrishnan,et al.  Efficient Active Algorithms for Hierarchical Clustering , 2012, ICML.

[16]  Eamonn J. Keogh,et al.  Clustering of time-series subsequences is meaningless: implications for previous and future research , 2004, Knowledge and Information Systems.

[17]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[18]  Jun Wang,et al.  Preliminary Test of a Real-Time, Interactive Silent Speech Interface Based on Electromagnetic Articulograph , 2014, SLPAT@ACL.

[19]  Eamonn J. Keogh,et al.  Exact Discovery of Time Series Motifs , 2009, SDM.

[20]  Eamonn J. Keogh,et al.  Generalizing Dynamic Time Warping to the Multi-Dimensional Case Requires an Adaptive Approach , 2014 .

[21]  T. H. Kyaw,et al.  Multiparameter Intelligent Monitoring in Intensive Care II: A public-access intensive care unit database* , 2011, Critical care medicine.

[22]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[23]  Eamonn J. Keogh,et al.  Clustering of time-series subsequences is meaningless: implications for previous and future research , 2003, Third IEEE International Conference on Data Mining.

[24]  Jure Leskovec,et al.  Patterns of temporal variation in online media , 2011, WSDM '11.

[25]  Eamonn J. Keogh,et al.  Addressing Big Data Time Series: Mining Trillions of Time Series Subsequences Under Dynamic Time Warping , 2013, TKDD.

[26]  Jun Wang,et al.  Generalizing DTW to the multi-dimensional case requires an adaptive approach , 2016, Data Mining and Knowledge Discovery.

[27]  Richard I. Hartley,et al.  Person Reidentification Using Spatiotemporal Appearance , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[28]  Dah-Jye Lee,et al.  Anytime Classification Using the Nearest Neighbor Algorithm with Applications to Stream Mining , 2006, Sixth International Conference on Data Mining (ICDM'06).

[29]  Eamonn J. Keogh,et al.  Everything you know about Dynamic Time Warping is Wrong , 2004 .

[30]  Qiang Fu,et al.  YADING: Fast Clustering of Large-Scale Time Series Data , 2015, Proc. VLDB Endow..

[31]  Eamonn J. Keogh,et al.  Experimental comparison of representation methods and distance measures for time series data , 2010, Data Mining and Knowledge Discovery.

[32]  Eamonn Keogh Exact Indexing of Dynamic Time Warping , 2002, VLDB.

[33]  Shlomo Zilberstein,et al.  Using Anytime Algorithms in Intelligent Systems , 1996, AI Mag..

[34]  Sean Hughes,et al.  Clustering by Fast Search and Find of Density Peaks , 2016 .