A Bit Level Representation for Time Series Data Mining with Shape Based Similarity

Clipping is the process of transforming a real valued series into a sequence of bits representing whether each data is above or below the average. In this paper, we argue that clipping is a useful and flexible transformation for the exploratory analysis of large time dependent data sets. We demonstrate how time series stored as bits can be very efficiently compressed and manipulated and that, under some assumptions, the discriminatory power with clipped series is asymptotically equivalent to that achieved with the raw data. Unlike other transformations, clipped series can be compared directly to the raw data series. We show that this means we can form a tight lower bounding metric for Euclidean and Dynamic Time Warping distance and hence efficiently query by content. Clipped data can be used in conjunction with a host of algorithms and statistical tests that naturally follow from the binary nature of the data. A series of experiments illustrate how clipped series can be used in increasingly complex ways to achieve better results than other popular representations. The usefulness of the proposed representation is demonstrated by the fact that the results with clipped data are consistently better than those achieved with a Wavelet or Discrete Fourier Transformation at the same compression ratio for both clustering and query by content. The flexibility of the representation is shown by the fact that we can take advantage of a variable Run Length Encoding of clipped series to define an approximation of the Kolmogorov complexity and hence perform Kolmogorov based clustering.

[1]  S. Rice Mathematical analysis of random noise , 1944 .

[2]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[3]  Eugene S. Schwartz,et al.  An Optimum Encoding with Minimum Longest Code and Total Number of Digits , 1964, Inf. Control..

[4]  Solomon W. Golomb,et al.  Run-length encodings (Corresp.) , 1966, IEEE Trans. Inf. Theory.

[5]  S. Golomb Run-length encodings. , 1966 .

[6]  J. V. Bradley Distribution-Free Statistical Tests , 1968 .

[7]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[8]  Glen G. Langdon,et al.  Arithmetic Coding , 1979 .

[9]  B. Kedem Estimation of the Parameters in Stationary Autoregressive Processes after Hard Limiting , 1980 .

[10]  G. W. Milligan,et al.  The Effect of Cluster Size, Dimensionality, and the Number of Clusters on Recovery of True Cluster Structure , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Eamonn J. Keogh,et al.  UCR Time Series Data Mining Archive , 1983 .

[12]  C. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings , 1983 .

[13]  Donald E. Knuth,et al.  Dynamic Huffman Coding , 1985, J. Algorithms.

[14]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[15]  Johan de Kleer,et al.  Readings in qualitative reasoning about physical systems , 1990 .

[16]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[17]  Clu-istos Foutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[18]  Donald J. Berndt,et al.  Using Dynamic Time Warping to Find Patterns in Time Series , 1994, KDD Workshop.

[19]  Jim Austin,et al.  Distributed associative memories for high-speed symbolic reasoning , 1996, Fuzzy Sets Syst..

[20]  Christos Faloutsos,et al.  Efficiently supporting ad hoc queries in large datasets of time sequences , 1997, SIGMOD '97.

[21]  Jim Austin,et al.  A Binary Correlation Matrix Memory k-NN Classifier with Hardware Implementation , 1998, BMVC.

[22]  Johannes Gehrke,et al.  CACTUS—clustering categorical data using summaries , 1999, KDD '99.

[23]  Ada Wai-Chee Fu,et al.  Efficient time series matching by wavelets , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[24]  S. R. Jammalamadaka,et al.  Scan Statistics and Applications , 2000 .

[25]  Christos Faloutsos,et al.  Fast Time Sequence Indexing for Arbitrary Lp Norms , 2000, VLDB.

[26]  J. Rayner,et al.  A Contingency Table Approach to Nonparametric Testing , 2000 .

[27]  Eamonn J. Keogh,et al.  A Simple Dimensionality Reduction Technique for Fast Similarity Search in Large Time Series Databases , 2000, PAKDD.

[28]  Padhraic Smyth,et al.  Deformable Markov model templates for time-series pattern matching , 2000, KDD '00.

[29]  George M. Church,et al.  Aligning gene expression time series with time warping algorithms , 2001, Bioinform..

[30]  Eamonn J. Keogh,et al.  Locally adaptive dimensionality reduction for indexing large time series databases , 2001, SIGMOD '01.

[31]  Toshiyuki Amagasa,et al.  The L - index: An indexing structure for ecient subsequence matching in time sequence databases , 2001 .

[32]  Eamonn Keogh Exact Indexing of Dynamic Time Warping , 2002, VLDB.

[33]  Eamonn J. Keogh,et al.  Locally adaptive dimensionality reduction for indexing large time series databases , 2001, SIGMOD '01.

[34]  Dit-Yan Yeung,et al.  Mixtures of ARMA models for model-based time series clustering , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[35]  Dennis Shasha,et al.  Warping indexes with envelope transforms for query by humming , 2003, SIGMOD '03.

[36]  Victoria J. Hodge,et al.  A Comparison of Standard Spell Checking Algorithms and a Novel Binary Neural Approach , 2003, IEEE Trans. Knowl. Data Eng..

[37]  Ming-Hui Chen,et al.  A Contingency Table Approach to Nonparametric Testing , 2003, Technometrics.

[38]  Eamonn J. Keogh,et al.  Probabilistic discovery of time series motifs , 2003, KDD '03.

[39]  F. Mörchen Time series feature extraction for data mining using DWT and DFT , 2003 .

[40]  Eamonn J. Keogh,et al.  A symbolic representation of time series, with implications for streaming algorithms , 2003, DMKD '03.

[41]  Carlos Ordonez,et al.  Clustering binary data streams with K-means , 2003, DMKD '03.

[42]  C. Giovanni Galizia,et al.  Odor-Driven Attractor Dynamics in the Antennal Lobe Allow for Simple and Rapid Olfactory Pattern Classification , 2004, Neural Computation.

[43]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[44]  Jayanta Basak,et al.  Weather Data Mining Using Independent Component Analysis , 2004, J. Mach. Learn. Res..

[45]  Gareth J. Janacek,et al.  Clustering time series from ARMA models with clipped data , 2004, KDD.

[46]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[47]  Eamonn J. Keogh,et al.  On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration , 2002, Data Mining and Knowledge Discovery.

[48]  Eamonn J. Keogh,et al.  Towards parameter-free data mining , 2004, KDD.

[49]  Raymond T. Ng,et al.  Indexing spatio-temporal trajectories with Chebyshev polynomials , 2004, SIGMOD '04.

[50]  Gareth J. Janacek,et al.  Clustering Time Series with Clipped Data , 2005, Machine Learning.

[51]  Philip S. Yu,et al.  On Periodicity Detection and Structural Periodic Similarity , 2005, SDM.

[52]  Jim Austin,et al.  DAME: Searching Large Data Sets Within a Grid-Enabled Engineering Application , 2005, Proceedings of the IEEE.

[53]  Eamonn J. Keogh,et al.  Three Myths about Dynamic Time Warping Data Mining , 2005, SDM.

[54]  Gareth J. Janacek,et al.  A Likelihood Ratio Distance Measure for the Similarity Between the Fourier Transform of Time Series , 2005, PAKDD.