An Efficient Aggregation Method for the Symbolic Representation of Temporal Data

Symbolic representations are a useful tool for the dimension reduction of temporal data, allowing for the efficient storage of and information retrieval from time series. They can also enhance the training of machine learning algorithms on time series data through noise reduction and reduced sensitivity to hyperparameters. The adaptive Brownian bridge-based aggregation (ABBA) method is one such effective and robust symbolic representation, demonstrated to accurately capture important trends and shapes in time series. However, in its current form the method struggles to process very large time series. Here we present a new variant of the ABBA method, called fABBA. This variant utilizes a new aggregation approach tailored to the piecewise representation of time series. By replacing the k-means clustering used in ABBA with a sorting-based aggregation technique, and thereby avoiding repeated sum-of-squares error computations, the computational complexity is significantly reduced. In contrast to the original method, the new approach does not require the number of time series symbols to be specified in advance. Through extensive tests we demonstrate that the new method significantly outperforms ABBA with a considerable reduction in runtime while also outperforming the popular SAX and 1d-SAX representations in terms of reconstruction accuracy. We further demonstrate that fABBA can compress other data types such as images.

[1]  Li Wei,et al.  Experiencing SAX: a novel symbolic representation of time series , 2007, Data Mining and Knowledge Discovery.

[2]  Jianbo Shi,et al.  Multiclass spectral clustering , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[3]  Eamonn J. Keogh,et al.  A symbolic representation of time series, with implications for streaming algorithms , 2003, DMKD '03.

[4]  Ira Assent,et al.  Anticipatory DTW for Efficient Similarity Search in Time Series Databases , 2009, Proc. VLDB Endow..

[5]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[6]  Tran Khanh Dang,et al.  HOT aSAX: A Novel Adaptive Symbolic Representation for Time Series Discords Discovery , 2010, ACIIDS.

[7]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[8]  Jorge J. Moré,et al.  Digital Object Identifier (DOI) 10.1007/s101070100263 , 2001 .

[9]  Ryan P. Browne,et al.  A mixture of generalized hyperbolic distributions , 2013, 1305.1036.

[10]  Antoine Cornuéjols,et al.  Symbolic Representation of Time Series: A Hierarchical Coclustering Formalization , 2015, AALTD@PKDD/ECML.

[11]  Marc Rußwurm,et al.  Tslearn, A Machine Learning Toolkit for Time Series Data , 2020, J. Mach. Learn. Res..

[12]  Aristides Gionis,et al.  Clustering aggregation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[13]  Leland McInnes,et al.  Accelerated Hierarchical Density Based Clustering , 2017, 2017 IEEE International Conference on Data Mining Workshops (ICDMW).

[14]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[15]  Eamonn J. Keogh,et al.  Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases , 2001, Knowledge and Information Systems.

[16]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[17]  Xiaoguang Hu,et al.  TrSAX-An improved time series symbolic representation for classification. , 2019, ISA transactions.

[18]  Hayato Yamana,et al.  An improved symbolic aggregate approximation distance measure based on its statistical features , 2016, iiWAS.

[19]  Anil K. Jain Data Clustering: User's Dilemma , 2007, MLDM.

[20]  Gerhard Thonhauser,et al.  Multivariate Time Series Classification by Combining Trend-Based and Value-Based Approximations , 2012, ICCSA.

[21]  Subutai Ahmad,et al.  Unsupervised real-time anomaly detection for streaming data , 2017, Neurocomputing.

[22]  Cor J. Veenman,et al.  A Maximum Variance Cluster Algorithm , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  S. Frühwirth-Schnatter,et al.  Bayesian inference for finite mixtures of univariate and multivariate skew-normal and skew-t distributions. , 2010, Biostatistics.

[24]  Liping Zhang,et al.  TSX: A Novel Symbolic Representation for Financial Time Series , 2012, PRICAI.

[25]  Heggere S. Ranganath,et al.  An analysis of time series representation methods: data mining applications perspective , 2014, ACM Southeast Regional Conference.

[26]  Stefan Güttel,et al.  Time Series Forecasting Using LSTM Networks: A Symbolic Approach , 2020, ArXiv.

[27]  M. Cugmas,et al.  On comparing partitions , 2015 .

[28]  Charles T. Zahn,et al.  Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters , 1971, IEEE Transactions on Computers.

[29]  Stefan Güttel,et al.  ABBA: adaptive Brownian bridge-based symbolic aggregation of time series , 2020, Data Mining and Knowledge Discovery.

[30]  Eamonn J. Keogh,et al.  An Enhanced Representation of Time Series Which Allows Fast and Accurate Classification, Clustering and Relevance Feedback , 1998, KDD.

[31]  ZhangAidong,et al.  WaveCluster: a wavelet-based clustering approach for spatial data in very large databases , 2000, VLDB 2000.

[32]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[33]  Dit-Yan Yeung,et al.  Robust path-based spectral clustering , 2008, Pattern Recognit..

[34]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[35]  Ricardo J. G. B. Campello,et al.  Density-Based Clustering Based on Hierarchical Density Estimates , 2013, PAKDD.

[36]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[37]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[38]  Christos Faloutsos,et al.  Efficiently supporting ad hoc queries in large datasets of time sequences , 1997, SIGMOD '97.

[39]  Leland McInnes,et al.  hdbscan: Hierarchical density based clustering , 2017, J. Open Source Softw..

[40]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[41]  Fei-Fei Li,et al.  Novel Dataset for Fine-Grained Image Categorization : Stanford Dogs , 2012 .

[42]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[43]  Nuno Constantino Castro,et al.  Time Series Data Mining , 2009, Encyclopedia of Database Systems.

[44]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[45]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[46]  Eamonn J. Keogh,et al.  Locally adaptive dimensionality reduction for indexing large time series databases , 2001, SIGMOD '01.

[47]  Romain Tavenard,et al.  1d-SAX: A Novel Symbolic Representation for Time Series , 2013, IDA.

[48]  Limin Fu,et al.  FLAME, a novel fuzzy clustering method for the analysis of DNA microarray data , 2007, BMC Bioinformatics.

[49]  Apostolos N. Papadopoulos,et al.  Efficient similarity search in streaming time sequences , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[50]  Eamonn J. Keogh,et al.  iSAX: indexing and mining terabyte sized time series , 2008, KDD.

[51]  Huan Liu,et al.  A Novel Symbolic Aggregate Approximation for Time Series , 2019, IMCOM.

[52]  Divyakant Agrawal,et al.  A comparison of DFT and DWT based similarity search in time-series databases , 2000, CIKM '00.

[53]  David R. Musser,et al.  Introspective Sorting and Selection Algorithms , 1997, Softw. Pract. Exp..