catch22: CAnonical Time-series CHaracteristics

Capturing the dynamical properties of time series concisely as interpretable feature vectors can enable efficient clustering and classification for time-series applications across science and industry. Selecting an appropriate feature-based representation of time series for a given application can be achieved through systematic comparison across a comprehensive time-series feature library, such as those in the hctsa toolbox. However, this approach is computationally expensive and involves evaluating many similar features, limiting the widespread adoption of feature-based representations of time series for real-world applications. In this work, we introduce a method to infer small sets of time-series features that (i) exhibit strong classification performance across a given collection of time-series problems, and (ii) are minimally redundant. Applying our method to a set of 93 time-series classification datasets (containing over 147,000 time series) and using a filtered version of the hctsa feature library (4791 features), we introduce a set of 22 CAnonical Time-series CHaracteristics, catch22, tailored to the dynamics typically encountered in time-series data-mining tasks. This dimensionality reduction, from 4791 to 22, is associated with an approximately 1000-fold reduction in computation time and near linear scaling with time-series length, despite an average reduction in classification accuracy of just 7%. catch22 captures a diverse and interpretable signature of time series in terms of their properties, including linear and non-linear autocorrelation, successive differences, value distributions and outliers, and fluctuation scaling properties. We provide an efficient implementation of catch22, accessible from many programming environments, that facilitates feature-based time-series analysis for scientific, industrial, financial and medical applications using a common language of interpretable time-series properties.

[1]  R. Fisher Statistical methods for research workers , 1927, Protoplasma.

[2]  Jason Lines,et al.  Transformation Based Ensembles for Time Series Classification , 2012, SDM.

[3]  John R. Williams,et al.  Clustering Household Electricity Use Profiles , 2013, MLSDA '13.

[4]  Clare Rosenfeld,et al.  Diabetes in Tanzania , 2004 .

[5]  A L Goldberger,et al.  The pNNx files: re-examining a widely used heart rate variability measure , 2002, Heart.

[6]  Jens Timmer,et al.  Characteristics of hand tremor time series , 1993, Biological Cybernetics.

[7]  Nick S. Jones,et al.  Automatic time-series phenotyping using massive feature extraction , 2016, bioRxiv.

[8]  Eamonn J. Keogh,et al.  The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances , 2016, Data Mining and Knowledge Discovery.

[9]  Max A. Little,et al.  Highly comparative time-series analysis: the empirical structure of time series and their methods , 2013, Journal of The Royal Society Interface.

[10]  A. Wayne Whitney,et al.  A Direct Method of Nonparametric Measurement Selection , 1971, IEEE Transactions on Computers.

[11]  Jason Lines,et al.  Time series classification with ensembles of elastic distance measures , 2015, Data Mining and Knowledge Discovery.

[12]  Yang-Sae Moon,et al.  Duality-based subsequence matching in time-series databases , 2001, Proceedings 17th International Conference on Data Engineering.

[13]  Donald J. Berndt,et al.  Using Dynamic Time Warping to Find Patterns in Time Series , 1994, KDD Workshop.

[14]  Eamonn J. Keogh,et al.  Time series shapelets: a new primitive for data mining , 2009, KDD.

[15]  Ben D. Fulcher,et al.  Structural connectome topology relates to regional BOLD signal dynamics in the mouse brain , 2016, bioRxiv.

[16]  Christos Faloutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[17]  Liang Wang,et al.  Structure-Based Statistical Features and Multivariate Time Series Clustering , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[18]  Yannis Manolopoulos,et al.  Feature-based classification of time-series data , 2001 .

[19]  Arvind Kumar Shekar,et al.  Selection of Relevant and Non-Redundant Multivariate Ordinal Patterns for Time Series Classification , 2018, DS.

[20]  Dimitrios Gunopulos,et al.  Trajectories, Discovering Similar , 2008, Encyclopedia of GIS.

[21]  Nick S. Jones,et al.  Highly Comparative Feature-Based Time-Series Classification , 2014, IEEE Transactions on Knowledge and Data Engineering.

[22]  Andrea Zanella,et al.  EC-CENTRIC: An Energy- and Context-Centric Perspective on IoT Systems and Protocol Design , 2017, IEEE Access.

[23]  Nick S. Jones,et al.  A self-organizing, living library of time-series data , 2020, Scientific Data.

[24]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[25]  Ben D. Fulcher,et al.  Feature-based time-series analysis , 2017, ArXiv.

[26]  Xiaozhe Wang,et al.  Characteristic-Based Clustering for Time Series Data , 2006, Data Mining and Knowledge Discovery.

[27]  Rob J. Hyndman,et al.  Large-Scale Unusual Time Series Detection , 2015, 2015 IEEE International Conference on Data Mining Workshop (ICDMW).

[28]  Dimitrios Gunopulos,et al.  Discovering similar multidimensional trajectories , 2002, Proceedings 18th International Conference on Data Engineering.

[29]  Patrick Schäfer The BOSS is concerned with time series classification in the presence of noise , 2014, Data Mining and Knowledge Discovery.

[30]  Eamonn J. Keogh,et al.  Fast Shapelets: A Scalable Algorithm for Discovering Time Series Shapelets , 2013, SDM.

[31]  Eamonn J. Keogh,et al.  The UCR time series archive , 2018, IEEE/CAA Journal of Automatica Sinica.

[32]  Nick S. Jones,et al.  CompEngine: a self-organizing, living library of time-series data , 2019, ArXiv.

[33]  KeoghEamonn,et al.  Time series shapelets , 2011 .

[34]  Jason Lines,et al.  Time-Series Classification with COTE: The Collective of Transformation-Based Ensembles , 2015, IEEE Transactions on Knowledge and Data Engineering.