Optimizing time series discretization for knowledge discovery

Knowledge Discovery in time series usually requires symbolic time series. Many discretization methods that convert numeric time series to symbolic time series ignore the temporal order of values. This often leads to symbols that do not correspond to states of the process generating the time series and cannot be interpreted meaningfully. We propose a new method for meaningful unsupervised discretization of numeric time series called Persist. The algorithm is based on the Kullback-Leibler divergence between the marginal and the self-transition probability distributions of the discretization symbols. Its performance is evaluated on both artificial and real life data in comparison to the most common discretization methods. Persist achieves significantly higher accuracy than existing static methods and is robust against noise. It also outperforms Hidden Markov Models for all but very simple cases.

[1]  Eamonn J. Keogh,et al.  Finding surprising patterns in a time series database in linear time and space , 2002, KDD.

[2]  Eamonn J. Keogh,et al.  A symbolic representation of time series, with implications for streaming algorithms , 2003, DMKD '03.

[3]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[4]  Mohammed Waleed Kadous,et al.  Learning Comprehensible Descriptions of Multivariate Time Series , 1999, ICML.

[5]  Henrik Boström,et al.  Learning First Order Logic Time Series Classifiers , 2000, ILP Work-in-progress reports.

[6]  Jeff A. Bilmes,et al.  A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models , 1998 .

[7]  Alfred Ultsch,et al.  Pareto Density Estimation: Probability Density Estimation for Knowledge Discovery , 2003 .

[8]  Jitender S. Deogun,et al.  Sequential Association Rule Mining with Time Lags , 2004, Journal of Intelligent Information Systems.

[9]  Ron Kohavi,et al.  Error-Based and Entropy-Based Discretization of Continuous Features , 1996, KDD.

[10]  Henrik Boström,et al.  Learning First Order Logic Time Series Classifiers: Rules and Boosting , 2000, PKDD.

[11]  Eamonn J. Keogh,et al.  On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration , 2002, Data Mining and Knowledge Discovery.

[12]  Huan Liu,et al.  Discretization: An Enabling Technique , 2002, Data Mining and Knowledge Discovery.

[13]  Abraham Kandel,et al.  Data Mining in Time Series Database , 2004 .

[14]  C. Finney,et al.  A review of symbolic analysis of experimental data , 2003 .

[15]  Eamonn J. Keogh,et al.  UCR Time Series Data Mining Archive , 1983 .

[16]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[17]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[18]  Magnus Lie Hetland,et al.  The Role of Discretization Parameters in Sequence Rule Evolution , 2003, KES.

[19]  Eamonn J. Keogh,et al.  Segmenting Time Series: A Survey and Novel Approach , 2002 .

[20]  Fabian Mörchen,et al.  Discovering Temporal Knowledge in Multivariate Time Series , 2004, GfKl.

[21]  Aristides Gionis,et al.  Finding recurrent sources in sequences , 2003, RECOMB '03.

[22]  Fabian Mörchen,et al.  Extracting interpretable muscle activation patterns with time series knowledge mining , 2005, Int. J. Knowl. Based Intell. Eng. Syst..

[23]  Magnus Lie Hetland,et al.  Temporal Rule Discovery using Genetic Programming and Specialized Hardware , 2004 .

[24]  Eamonn J. Keogh,et al.  Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases , 2001, Knowledge and Information Systems.

[25]  Urpo Tuomela,et al.  Sensor signal data set for exploring context recognition of mobile devices , 2004 .