Using the minimum description length to discover the intrinsic cardinality and dimensionality of time series

Many algorithms for data mining or indexing time series data do not operate directly on the raw data, but instead they use alternative representations that include transforms, quantization, approximation, and multi-resolution abstractions. Choosing the best representation and abstraction level for a given task/dataset is arguably the most critical step in time series data mining. In this work, we investigate the problem of discovering the natural intrinsic representation model, dimensionality and alphabet cardinality of a time series. The ability to automatically discover these intrinsic features has implications beyond selecting the best parameters for particular algorithms, as characterizing data in such a manner is useful in its own right and an important sub-routine in algorithms for classification, clustering and outlier discovery. We will frame the discovery of these intrinsic features in the Minimal Description Length framework. Extensive empirical tests show that our method is simpler, more general and more accurate than previous methods, and has the important advantage of being essentially parameter-free.

[1]  Nikolai K. Vereshchagin,et al.  Rate Distortion and Denoising of Individual Data Using Kolmogorov Complexity , 2010, IEEE Transactions on Information Theory.

[2]  Mark A. Pitt,et al.  Advances in Minimum Description Length: Theory and Applications , 2005 .

[3]  D. L. Donoho,et al.  Ideal spacial adaptation via wavelet shrinkage , 1994 .

[4]  Eamonn J. Keogh,et al.  An online algorithm for segmenting time series , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[5]  Frank L. Lewis,et al.  Intelligent Fault Diagnosis and Prognosis for Engineering Systems , 2006 .

[6]  Eamonn J. Keogh,et al.  Accelerating Dynamic Time Warping Subsequence Search with GPUs and FPGAs , 2010, 2010 IEEE International Conference on Data Mining.

[7]  Li Wei,et al.  Experiencing SAX: a novel symbolic representation of time series , 2007, Data Mining and Knowledge Discovery.

[8]  Eamonn J. Keogh,et al.  iSAX 2.0: Indexing and Mining One Billion Time Series , 2010, 2010 IEEE International Conference on Data Mining.

[9]  Daniel Lemire,et al.  A Better Alternative to Piecewise Linear Time Series Segmentation , 2006, SDM.

[10]  In Jae Myung,et al.  A minimum description length principle for perception , 2005 .

[11]  H. Jay Zwally,et al.  Passive microwave images of the polar regions and research applications , 1977, Polar Record.

[12]  Eamonn J. Keogh,et al.  On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration , 2002, Data Mining and Knowledge Discovery.

[13]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[14]  I. Johnstone,et al.  Ideal spatial adaptation by wavelet shrinkage , 1994 .

[15]  Jianbo Yu,et al.  A similarity-based prognostics approach for Remaining Useful Life estimation of engineered systems , 2008, 2008 International Conference on Prognostics and Health Management.

[16]  Jorma Rissanen,et al.  Density estimation by stochastic complexity , 1992, IEEE Trans. Inf. Theory.

[17]  Richard A. Davis,et al.  Break Detection for a Class of Nonlinear Time Series Models , 2008 .

[18]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[19]  Fabian Mörchen,et al.  Optimizing time series discretization for knowledge discovery , 2005, KDD '05.

[20]  Majid Sarrafzadeh,et al.  Unsupervised Discovery of Abnormal Activity Occurrences in Multi-dimensional Time Series, with Applications in Wearable Systems , 2010, SDM.

[21]  Pavlos Protopapas,et al.  Finding anomalous periodic time series , 2009, Machine Learning.

[22]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[23]  Radu-Daniel Vatavu,et al.  The impact of motion dimensionality and bit cardinality on the design of 3D gesture recognizers , 2013, Int. J. Hum. Comput. Stud..

[24]  M. Fily,et al.  Surface melting derived from microwave radiometers: a climatic indicator in Antarctica , 2007, Annals of Glaciology.

[25]  Eamonn J. Keogh,et al.  Discovering the Intrinsic Cardinality and Dimensionality of Time Series Using MDL , 2011, 2011 IEEE 11th International Conference on Data Mining.

[26]  Chris H Wiggins,et al.  Learning rates and states from biophysical time series: a Bayesian approach to model selection and single-molecule FRET data. , 2009, Biophysical journal.

[27]  Steven de Rooij,et al.  Approximating Rate-Distortion Graphs of Individual Data: Experiments in Lossy Compression and Denoising , 2012, IEEE Transactions on Computers.

[28]  William I. Gasarch,et al.  Book Review: An introduction to Kolmogorov Complexity and its Applications Second Edition, 1997 by Ming Li and Paul Vitanyi (Springer (Graduate Text Series)) , 1997, SIGACT News.

[29]  Kerriann H. Malatesta,et al.  The AAVSO Data Validation Project , 2006 .

[30]  Hui Ding,et al.  Querying and mining of time series data: experimental comparison of representations and distance measures , 2008, Proc. VLDB Endow..

[31]  Jonathan Miller,et al.  MicroRNA Target Detection and Analysis for Genes Related to Breast Cancer Using MDLcompress , 2007, EURASIP J. Bioinform. Syst. Biol..

[32]  Jessica Lin,et al.  Finding Motifs in Time Series , 2002, KDD 2002.

[33]  P. Protopapas,et al.  Finding outlier light curves in catalogues of periodic variable stars , 2005, astro-ph/0505495.

[34]  Pasi Fränti,et al.  Knee Point Detection in BIC for Detecting the Number of Clusters , 2008, ACIVS.

[35]  Ira Assent,et al.  The TS-tree: efficient time series search and retrieval , 2008, EDBT '08.

[36]  Christos Faloutsos,et al.  Parameter-free spatial data mining using MDL , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[37]  Abhinav Saxena,et al.  - 1-A COMPARISON OF THREE DATA-DRIVEN TECHNIQUES FOR PROGNOSTICS , 2008 .

[38]  Lawrence B. Holder,et al.  Attribute-Value Selection Based on Minimum Description Length , 2004, IC-AI.

[39]  Eamonn J. Keogh,et al.  Disk aware discord discovery: finding unusual time series in terabyte sized datasets , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[40]  Philip Chan,et al.  Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[41]  Petri Myllymäki,et al.  MDL Histogram Density Estimation , 2007, AISTATS.

[42]  Eamonn J. Keogh,et al.  A Simple Dimensionality Reduction Technique for Fast Similarity Search in Large Time Series Databases , 2000, PAKDD.

[43]  Eamonn J. Keogh,et al.  MDL-based time series clustering , 2012, Knowledge and Information Systems.

[44]  C. S. Wallace,et al.  An Information Measure for Classification , 1968, Comput. J..

[45]  A. Kehagias A Hidden Markov Model Segmentation Procedure for Hydrological and Enviromental Time Series , 2002 .

[46]  Héctor-Gabriel Acosta-Mesa,et al.  Discretization of Time Series Dataset with a Genetic Search , 2009, MICAI.

[47]  Jorma Rissanen,et al.  Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[48]  F.O. Heimes,et al.  Recurrent neural networks for remaining useful life estimation , 2008, 2008 International Conference on Prognostics and Health Management.

[49]  Ya I Molkov,et al.  Using the minimum description length principle for global reconstruction of dynamic systems from noisy time series. , 2009, Physical review. E, Statistical, nonlinear, and soft matter physics.

[50]  Edwin P. D. Pednault,et al.  Some Experiments in Applying Inductive Inference Principles to Surface Reconstruction , 1989, IJCAI.

[51]  KeoghEamonn,et al.  Querying and mining of time series data , 2008, VLDB 2008.

[52]  Dimitrios Gunopulos,et al.  Streaming Time Series Summarization Using User-Defined Amnesic Functions , 2008, IEEE Transactions on Knowledge and Data Engineering.

[53]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[54]  Siegfried Nijssen,et al.  MDL-Based Analysis of Time Series at Multiple Time-Scales , 2012, ECML/PKDD.

[55]  Ath. Kehagias,et al.  A hidden Markov model segmentation procedure for hydrological and environmental time series , 2004 .

[56]  Paul R. Cohen,et al.  Segmenting time series with a hybrid neural networks - hidden Markov model , 2002, AAAI/IAAI.