Data-driven Kernel-based Probabilistic SAX for Time Series Dimensionality Reduction

The ever-increasing volume and complexity of time series data, emerging in various application domains, necessitate efficient dimensionality reduction for facilitating data mining tasks. Symbolic representations, among them symbolic aggregate approximation (SAX), have proven very effective in compacting the information content of time series while exploiting the wealth of search algorithms used in bioinformatics and text mining communities. However, typical SAX-based techniques rely on a Gaussian assumption for the underlying data statistics, which often deteriorates their performance in practical scenarios. To overcome this limitation, this work introduces a method that negates any assumption on the probability distribution of time series. Specifically, a data-driven kernel density estimator is first applied on the data, followed by Lloyd-Max quantization to determine the optimal horizontal segmentation breakpoints. Experimental evaluation on distinct datasets demonstrates the superiority of our method, in terms of reconstruction accuracy and tightness of lower bound, when compared against the conventional and a modified SAX method.

[1]  Philip S. Yu,et al.  Adaptive query processing for time-series data , 1999, KDD '99.

[2]  Tom Armstrong,et al.  Using Modified Multivariate Bag-of-Words Models to Classify Physiological Data , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[3]  Tran Khanh Dang,et al.  Two Novel Adaptive Symbolic Representations for Similarity Search in Time Series Databases , 2010, 2010 12th International Asia-Pacific Web Conference.

[4]  Romain Tavenard,et al.  1d-SAX: A Novel Symbolic Representation for Time Series , 2013, IDA.

[5]  Clu-istos Foutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[6]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[7]  Li Wei,et al.  Experiencing SAX: a novel symbolic representation of time series , 2007, Data Mining and Knowledge Discovery.

[8]  Yi Wang,et al.  Clustering of Electricity Consumption Behavior Dynamics Toward Big Data Applications , 2016, IEEE Transactions on Smart Grid.

[9]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[10]  Jiuyong Li,et al.  An improvement of symbolic aggregate approximation distance measure for time series , 2014, Neurocomputing.

[11]  Eamonn J. Keogh,et al.  iSAX: indexing and mining terabyte sized time series , 2008, KDD.

[12]  Panu Somervuo,et al.  Self-organizing maps of symbol strings , 1998, Neurocomputing.

[13]  Qiang Wang,et al.  A multiresolution symbolic representation of time series , 2005, 21st International Conference on Data Engineering (ICDE'05).

[14]  Joel Max,et al.  Quantizing for minimum distortion , 1960, IRE Trans. Inf. Theory.

[15]  Tak-Chung Fu,et al.  A review on time series data mining , 2011, Eng. Appl. Artif. Intell..

[16]  Kyoji Kawagoe,et al.  Extended SAX: Extension of Symbolic Aggregate Approximation for Financial Time Series Data Representation , 2006 .

[17]  Kristof Van Laerhoven,et al.  Detecting leisure activities with dense motif discovery , 2012, UbiComp.

[18]  V. A. Epanechnikov Non-Parametric Estimation of a Multivariate Probability Density , 1969 .

[19]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[20]  Arno Schlueter,et al.  Automated daily pattern filtering of measured building performance data , 2015 .

[21]  Heikki Mannila,et al.  Rule Discovery from Time Series , 1998, KDD.

[22]  Ying Wah Teh,et al.  Stock market co-movement assessment using a three-phase clustering method , 2014, Expert Syst. Appl..