k-Shape: Efficient and Accurate Clustering of Time Series

The proliferation and ubiquity of temporal data across many disciplines has generated substantial interest in the analysis and mining of time series. Clustering is one of the most popular data mining methods, not only due to its exploratory power, but also as a preprocessing step or subroutine for other techniques. In this paper, we present k-Shape, a novel algorithm for time-series clustering. k-Shape relies on a scalable iterative refinement procedure, which creates homogeneous and well-separated clusters. As its distance measure, k-Shape uses a normalized version of the cross-correlation measure in order to consider the shapes of time series while comparing them. Based on the properties of that distance measure, we develop a method to compute cluster centroids, which are used in every iteration to update the assignment of time series to clusters. To demonstrate the robustness of k-Shape, we perform an extensive experimental evaluation of our approach against partitional, hierarchical, and spectral clustering methods, with combinations of the most competitive distance measures. k-Shape outperforms all scalable approaches in terms of accuracy. Furthermore, k-Shape also outperforms all non-scalable (and hence impractical) combinations, with one exception that achieves similar accuracy results. However, unlike k-Shape, this combination requires tuning of its distance measure and is two orders of magnitude slower than k-Shape. Overall, k-Shape emerges as a domain-independent, highly accurate, and highly efficient clustering approach for time series with broad applications.

[1]  Ge Yu,et al.  Similarity Match Over High Speed Time-Series Streams , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[2]  Pierre Gançarski,et al.  A global averaging method for dynamic time warping, with applications to clustering , 2011, Pattern Recognit..

[3]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[4]  Kuniaki Uehara,et al.  Extraction of Primitive Motion and Discovery of Association Rules from Human Motion Data , 2002, Progress in Discovery Science.

[5]  Dimitrios Gunopulos,et al.  Indexing Multidimensional Time-Series , 2004, The VLDB Journal.

[6]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[7]  R. Stephenson A and V , 1962, The British journal of ophthalmology.

[8]  Dimitrios Gunopulos,et al.  Discovering similar multidimensional trajectories , 2002, Proceedings 18th International Conference on Data Engineering.

[9]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[10]  Eamonn J. Keogh,et al.  Time Series Classification under More Realistic Assumptions , 2013, SDM.

[11]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[12]  Nikos Mamoulis,et al.  Fast and Exact Warping of Time Series Using Adaptive Segmental Approximations , 2005, Machine Learning.

[13]  Pierre Gançarski,et al.  Summarizing a set of time series by averaging: From Steiner sequence to compact multiple alignment , 2012, Theor. Comput. Sci..

[14]  Michalis Vazirgiannis,et al.  c ○ 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. On Clustering Validation Techniques , 2022 .

[15]  Eamonn J. Keogh,et al.  Time series shapelets: a new primitive for data mining , 2009, KDD.

[16]  Eamonn J. Keogh,et al.  Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping , 2012, KDD.

[17]  R. Mantegna Hierarchical structure in financial markets , 1998, cond-mat/9802256.

[18]  Eamonn Keogh Exact Indexing of Dynamic Time Warping , 2002, VLDB.

[19]  Aristides Gionis,et al.  Correlating financial time series with micro-blogging activity , 2012, WSDM '12.

[20]  Eamonn J. Keogh A decade of progress in indexing and mining large time series databases , 2006, VLDB.

[21]  Eamonn J. Keogh,et al.  Logical-shapelets: an expressive primitive for time series classification , 2011, KDD.

[22]  Eamonn J. Keogh,et al.  CID: an efficient complexity-invariant distance for time series , 2013, Data Mining and Knowledge Discovery.

[23]  Eamonn J. Keogh,et al.  Making Time-Series Classification More Accurate Using Learned Constraints , 2004, SDM.

[24]  Anthony K. H. Tung,et al.  SpADe: On Shape-based Pattern Detection in Streaming Time Series , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[25]  C. Faloutsos Eecient Similarity Search in Sequence Databases , 1993 .

[26]  Steve Goddard,et al.  Geospatial decision support for drought risk management , 2003, CACM.

[27]  Luis Gravano,et al.  k-Shape: Efficient and Accurate Clustering of Time Series , 2016, SGMD.

[28]  Eamonn J. Keogh,et al.  Experimental comparison of representation methods and distance measures for time series data , 2010, Data Mining and Knowledge Discovery.

[29]  Dit-Yan Yeung,et al.  Mixtures of ARMA models for model-based time series clustering , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[30]  Lei Chen,et al.  Robust and fast similarity search for moving object trajectories , 2005, SIGMOD '05.

[31]  Dimitrios Gunopulos,et al.  Embedding-based subsequence matching in time-series databases , 2011, TODS.

[32]  Vit Niennattrakul,et al.  Shape-Based Clustering for Time Series Data , 2012, PAKDD.

[33]  Dina Q. Goldin,et al.  On Similarity Queries for Time-Series Data: Constraint Specification and Implementation , 1995, CP.

[34]  Dennis Shasha,et al.  StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time , 2002, VLDB.

[35]  Hui Ding,et al.  Querying and mining of time series data: experimental comparison of representations and distance measures , 2008, Proc. VLDB Endow..

[36]  Piotr Indyk,et al.  Mining the stock market (extended abstract): which measure is best? , 2000, KDD '00.

[37]  P. Boesiger,et al.  A new correlation‐based fuzzy logic clustering algorithm for FMRI , 1998, Magnetic resonance in medicine.

[38]  Shuai Wang,et al.  Mining of Moving Objects from Time-Series Images and its Application to Satellite Weather Imagery , 2004, Journal of Intelligent Information Systems.

[39]  Dimitrios Gunopulos,et al.  Iterative Incremental Clustering of Time Series , 2004, EDBT.

[41]  Pierre Hansen,et al.  NP-hardness of Euclidean sum-of-squares clustering , 2008, Machine Learning.

[42]  Francesco Masulli,et al.  A survey of kernel and spectral methods for clustering , 2008, Pattern Recognit..

[43]  T. Warren Liao,et al.  Clustering of time series data - a survey , 2005, Pattern Recognit..

[44]  Wendy R. Fox,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1991 .

[45]  M. Friedman The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .

[46]  Eamonn J. Keogh,et al.  Clustering of time-series subsequences is meaningless: implications for previous and future research , 2004, Knowledge and Information Systems.

[47]  Clu-istos Foutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[48]  Christos Faloutsos,et al.  BRAID: stream mining through group lag correlations , 2005, SIGMOD '05.

[49]  Christos Faloutsos,et al.  Efficiently supporting ad hoc queries in large datasets of time sequences , 1997, SIGMOD '97.

[50]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[51]  C. Ratanamahatana,et al.  Shape averaging under Time Warping , 2009, 2009 6th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology.

[52]  Clement T. Yu,et al.  Haar Wavelets for Efficient Similarity Search of Time-Series: With and Without Time Warping , 2003, IEEE Trans. Knowl. Data Eng..

[53]  Gautam Das,et al.  The Move-Split-Merge Metric for Time Series , 2013, IEEE Transactions on Knowledge and Data Engineering.

[54]  Yannis Theodoridis,et al.  Index-based Most Similar Trajectory Search , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[55]  Yunhao Liu,et al.  Indexable PLA for Efficient Similarity Search , 2007, VLDB.

[56]  Jure Leskovec,et al.  Patterns of temporal variation in online media , 2011, WSDM '11.

[57]  Philip S. Yu,et al.  Detecting Leaders from Correlated Time Series , 2010, DASFAA.

[58]  Dragomir Anguelov,et al.  Mining The Stock Market : Which Measure Is Best ? , 2000 .

[59]  Philip S. Yu,et al.  MALM: a framework for mining sequence database at multiple abstraction levels , 1998, CIKM '98.

[60]  Tim Oates,et al.  Identifying distinctive subsequences in multivariate time series by clustering , 1999, KDD '99.

[61]  Meena Mahajan,et al.  The Planar k-means Problem is NP-hard I , 2009 .

[62]  Heikki Mannila,et al.  Rule Discovery from Time Series , 1998, KDD.

[63]  Lei Chen,et al.  On The Marriage of Lp-norms and Edit Distance , 2004, VLDB.

[64]  Gustavo E. A. P. A. Batista,et al.  An Empirical Comparison of Dissimilarity Measures for Time Series Classification , 2013, 2013 Brazilian Conference on Intelligent Systems.

[65]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[66]  Raymond T. Ng,et al.  Indexing spatio-temporal trajectories with Chebyshev polynomials , 2004, SIGMOD '04.

[67]  B. Jaumard,et al.  Cluster Analysis and Mathematical Programming , 2003 .

[68]  R. Coifman,et al.  Local feature extraction and its applications using a library of bases , 1994 .

[69]  Gareth J. Janacek,et al.  Clustering time series from ARMA models with clipped data , 2004, KDD.

[70]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[71]  Donald J. Berndt,et al.  Using Dynamic Time Warping to Find Patterns in Time Series , 1994, KDD Workshop.

[72]  Philip S. Yu,et al.  Local Correlation Tracking in Time Series , 2006, Sixth International Conference on Data Mining (ICDM'06).

[73]  Kurt Hornik,et al.  A Combination Scheme for Fuzzy Clustering , 2002, AFSS.

[74]  C. Faloutsos,et al.  cient Similarity Search In Sequence Databases , 2004 .

[75]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[76]  Tommi S. Jaakkola,et al.  A new approach to analyzing gene expression time series data , 2002, RECOMB '02.

[77]  Vladimir Pavlovic,et al.  Discovering clusters in motion time-series data , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[78]  J. Rice Mathematical Statistics and Data Analysis , 1988 .

[79]  Ada Wai-Chee Fu,et al.  Efficient time series matching by wavelets , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[80]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[81]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[82]  Eamonn J. Keogh,et al.  Time Series Epenthesis: Clustering Time Series Streams Requires Ignoring Some Data , 2011, 2011 IEEE 11th International Conference on Data Mining.

[83]  L. Gupta,et al.  Nonlinear alignment and averaging for estimating the evoked potential , 1996, IEEE Transactions on Biomedical Engineering.

[84]  Hao Wang,et al.  Durable Queries over Historical Time Series , 2014, IEEE Transactions on Knowledge and Data Engineering.

[85]  Eamonn J. Keogh,et al.  Clustering Time Series Using Unsupervised-Shapelets , 2012, 2012 IEEE 12th International Conference on Data Mining.

[86]  Xiaozhe Wang,et al.  Characteristic-Based Clustering for Time Series Data , 2006, Data Mining and Knowledge Discovery.

[87]  Jignesh M. Patel,et al.  An efficient and accurate method for evaluating time series similarity , 2007, SIGMOD '07.

[88]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[89]  Tao Jiang,et al.  On the Complexity of Multiple Sequence Alignment , 1994, J. Comput. Biol..

[90]  Qiang Wang,et al.  A multiresolution symbolic representation of time series , 2005, 21st International Conference on Data Engineering (ICDE'05).

[91]  L. K. Hansen,et al.  On Clustering fMRI Time Series , 1999, NeuroImage.

[92]  Chengchun Hao Introduction to Harmonic Analysis , 2016 .

[93]  Konstantinos Kalpakis,et al.  Distance measures for effective clustering of ARIMA time-series , 2001, Proceedings 2001 IEEE International Conference on Data Mining.