SAX-EFG: an evolutionary feature generation framework for time series classification

A variety of real world applications fit into the broad definition of time series classification. Using traditional machine learning approaches such as treating the time series sequences as high dimensional vectors have faced the well known "curse of dimensionality" problem. Recently, the field of time series classification has seen success by using preprocessing steps that discretize the time series using a Symbolic Aggregate ApproXimation technique (SAX) and using recurring subsequences ("motifs") as features. In this paper we explore a feature construction algorithm based on genetic programming that uses SAX-generated motifs as the building blocks for the construction of more complex features. The research shows that the constructed complex features improve the classification accuracy in a statistically significant manner for many applications.

[1]  Li Wei,et al.  Experiencing SAX: a novel symbolic representation of time series , 2007, Data Mining and Knowledge Discovery.

[2]  R Staden Computer methods to locate signals in nucleic acid sequences , 1984, Nucleic Acids Res..

[3]  Eamonn J. Keogh,et al.  On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration , 2002, Data Mining and Knowledge Discovery.

[4]  Gunnar Rätsch,et al.  Accurate splice site prediction using support vector machines , 2007, BMC Bioinformatics.

[5]  Burkhard Morgenstern,et al.  On splice site prediction using weight array models: a comparison of smoothing techniques , 2007 .

[6]  A. P. Dawid,et al.  Generative or Discriminative? Getting the Best of Both Worlds , 2007 .

[7]  Sean Luke,et al.  Population Implosion in Genetic Programming , 2003, GECCO.

[8]  Clu-istos Foutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[9]  Henrik Boström,et al.  Learning First Order Logic Time Series Classifiers: Rules and Boosting , 2000, PKDD.

[10]  R. Larsen,et al.  An introduction to mathematical statistics and its applications (2nd edition) , by R. J. Larsen and M. L. Marx. Pp 630. £17·95. 1987. ISBN 13-487166-9 (Prentice-Hall) , 1987, The Mathematical Gazette.

[11]  Kenneth A. De Jong,et al.  An Evolutionary Algorithm Approach for Feature Generation from Sequence Data and Its Application to DNA Splice Site Prediction , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[12]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[13]  Johanne Cohen,et al.  Shuffling biological sequences with motif constraints , 2008, J. Discrete Algorithms.

[14]  Armin Shmilovici,et al.  Identification of transcription factor binding sites with variable-order Bayesian networks , 2005, Bioinform..

[15]  Eamonn J. Keogh,et al.  Logical-shapelets: an expressive primitive for time series classification , 2011, KDD.

[16]  Eamonn J. Keogh,et al.  Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases , 2001, Knowledge and Information Systems.

[17]  Gunnar Rätsch,et al.  POIMs: positional oligomer importance matrices—understanding support vector machine-based signal detectors , 2008, ISMB.

[18]  Eamonn J. Keogh,et al.  Making Time-Series Classification More Accurate Using Learned Constraints , 2004, SDM.

[19]  Michael I. Jordan,et al.  A Hierarchical Bayesian Markovian Model for Motifs in Biopolymer Sequences , 2002, NIPS.

[20]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[21]  Jens Keilwagen,et al.  Jstacs: A Java Framework for Statistical Analysis and Classification of Biological Sequences , 2012, J. Mach. Learn. Res..

[22]  Daniel P. Siewiorek,et al.  Generalized feature extraction for structural pattern recognition in time-series data , 2001 .

[23]  Koby Crammer,et al.  Global Discriminative Learning for Higher-Accuracy Computational Gene Prediction , 2007, PLoS Comput. Biol..

[24]  Ada Wai-Chee Fu,et al.  Efficient time series matching by wavelets , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[25]  Manuela M. Veloso,et al.  Conditional random fields for activity recognition , 2007, AAMAS '07.

[26]  Qiang Wang,et al.  A symbolic representation of time series , 2005, Proceedings of the Eighth International Symposium on Signal Processing and Its Applications, 2005..

[27]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[28]  Simon Kasif,et al.  Modeling splice sites with Bayes networks , 2000, Bioinform..

[29]  Vasant Honavar,et al.  Discriminatively trained Markov model for sequence classification , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[30]  Simon J. Perkins,et al.  Genetic Algorithms and Support Vector Machines for Time Series Classification , 2002, Optics + Photonics.

[31]  William M. Spears,et al.  Crossover or Mutation? , 1992, FOGA.

[32]  Hui Ding,et al.  Querying and mining of time series data: experimental comparison of representations and distance measures , 2008, Proc. VLDB Endow..

[33]  G. Stormo,et al.  Computational technique for improvement of the position-weight matrices for the DNA/protein binding sites , 2005, Nucleic acids research.

[34]  Jens Keilwagen,et al.  De-Novo Discovery of Differentially Abundant Transcription Factor Binding Sites Including Their Positional Preference , 2011, PLoS Comput. Biol..

[35]  Pierre Geurts,et al.  Pattern Extraction for Time Series Classification , 2001, PKDD.

[36]  Colin R. Reeves,et al.  Evolutionary computation: a unified approach , 2007, Genetic Programming and Evolvable Machines.

[37]  Liu Xiao-ying Fast Subsequence Matching in Time-series Database , 2008 .

[38]  Eamonn J. Keogh,et al.  Mining motifs in massive time series databases , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[39]  Eamonn J. Keogh,et al.  A symbolic representation of time series, with implications for streaming algorithms , 2003, DMKD '03.

[40]  Eamonn J. Keogh,et al.  A Novel Bit Level Time Series Representation with Implication of Similarity Search and Clustering , 2005, PAKDD.

[41]  Kenneth A. De Jong,et al.  An evolutionary-based approach for feature generation: Eukaryotic promoter recognition , 2011, 2011 IEEE Congress of Evolutionary Computation (CEC).

[42]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[43]  Donald J. Berndt,et al.  Using Dynamic Time Warping to Find Patterns in Time Series , 1994, KDD Workshop.

[44]  Eamonn J. Keogh,et al.  Time series shapelets: a new primitive for data mining , 2009, KDD.

[45]  Gunnar Rätsch,et al.  The SHOGUN Machine Learning Toolbox , 2010, J. Mach. Learn. Res..

[46]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.