Significant motifs in time series

Time series motif discovery is the task of extracting previously unknown recurrent patterns from time series data. It is an important problem within applications that range from finance to health. Many algorithms have been proposed for the task of efficiently finding motifs. Surprisingly, most of these proposals do not focus on how to evaluate the discovered motifs. They are typically evaluated by human experts. This is unfeasible even for moderately sized datasets, since the number of discovered motifs tends to be prohibitively large. Statistical significance tests are widely used in the data mining communities to evaluate extracted patterns. In this work we present an approach to calculate time series motifs statistical significance. Our proposal leverages work from the bioinformatics community by using a symbolic definition of time series motifs to derive each motif's p-value. We estimate the expected frequency of a motif by using Markov Chain models. The p-value is then assessed by comparing the actual frequency to the estimated one using statistical hypothesis tests. Our contribution gives means to the application of a powerful technique—statistical tests—to a time series setting. This provides researchers and practitioners with an important tool to evaluate automatically the degree of relevance of each extracted motif. © 2012 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 5: 35–53, 2012 © 2012 Wiley Periodicals, Inc.

[1]  Eamonn J. Keogh,et al.  Finding surprising patterns in a time series database in linear time and space , 2002, KDD.

[2]  Kuniaki Uehara,et al.  Discovery of Time-Series Motif from Multi-Dimensional Data Based on MDL Principle , 2005, Machine Learning.

[3]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[4]  Eamonn J. Keogh,et al.  Mining motifs in massive time series databases , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[5]  Jessica Lin,et al.  Finding Motifs in Time Series , 2002, KDD 2002.

[6]  Geoffrey I. Webb Discovering Significant Patterns , 2007, Machine Learning.

[7]  Sven Rahmann,et al.  Efficient exact motif discovery , 2009, Bioinform..

[8]  Paulo J. Azevedo,et al.  Mining Approximate Motifs in Time Series , 2006, Discovery Science.

[9]  Eamonn J. Keogh,et al.  iSAX: indexing and mining terabyte sized time series , 2008, KDD.

[10]  Hui Ding,et al.  Querying and mining of time series data: experimental comparison of representations and distance measures , 2008, Proc. VLDB Endow..

[11]  Maguelonne Teisseire,et al.  Mining for unexpected sequential patterns given a Markov model , 2008 .

[12]  Sophie Schbath Statistics of motifs , 2006 .

[13]  Eamonn J. Keogh,et al.  Finding Time Series Motifs in Disk-Resident Data , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[14]  Paulo J. Azevedo,et al.  Multiresolution Motif Discovery in Time Series , 2010, SDM.

[15]  Christopher T. Workman,et al.  DASS: efficient discovery and p-value calculation of substructures in unordered data , 2007, Bioinform..

[16]  Irfan A. Essa,et al.  Improving Activity Discovery with Automatic Neighborhood Estimation , 2007, IJCAI.

[17]  Mireille Régnier,et al.  Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of cis-regulatory modules , 2007, Algorithms for Molecular Biology.

[18]  Stéphane Robin,et al.  Network motifs : mean and variance for the count , 2006 .

[19]  David B. Allison,et al.  How accurate are the extremely small P-values used in genomic research: An evaluation of numerical libraries , 2009, Comput. Stat. Data Anal..

[20]  Eamonn J. Keogh,et al.  UCR Time Series Data Mining Archive , 1983 .

[21]  Vincent Vandewalle,et al.  Statistical tests to compare motif count exceptionalities , 2007, BMC Bioinformatics.

[22]  Fabian Mörchen,et al.  Efficient mining of understandable patterns from multivariate interval time series , 2007, Data Mining and Knowledge Discovery.

[23]  Eamonn J. Keogh,et al.  Probabilistic discovery of time series motifs , 2003, KDD '03.

[24]  Ambuj K. Singh,et al.  GraphRank: Statistical Modeling and Mining of Significant Subgraphs in the Feature Space , 2006, Sixth International Conference on Data Mining (ICDM'06).

[25]  Franck Picard,et al.  Assessing the Exceptionality of Network Motifs , 2007, J. Comput. Biol..

[26]  S. Shen-Orr,et al.  Network motifs: simple building blocks of complex networks. , 2002, Science.

[27]  Stéphane Robin,et al.  Numerical Comparison of Several Approximations of the Word Count Distribution in Random Sequences , 2002, J. Comput. Biol..

[28]  Sophie Schbath,et al.  An Overview on the Distribution of Word Counts in Markov Chains , 2000, J. Comput. Biol..

[29]  Eamonn J. Keogh,et al.  Exact Discovery of Time Series Motifs , 2009, SDM.

[30]  Emanuele Raineri,et al.  Faster exact Markovian probability functions for motif occurrences: a DFA-only approach , 2008, Bioinform..

[31]  Eamonn J. Keogh,et al.  Online discovery and maintenance of time series motifs , 2010, KDD.

[32]  Grégory Nuel,et al.  Effective p-value computations using Finite Markov Chain Imbedding (FMCI): application to local score and to pattern statistics , 2006, Algorithms for Molecular Biology.

[33]  Marc Sebban,et al.  Mining probabilistic automata: a statistical view of sequential pattern mining , 2008, Machine Learning.

[34]  Eamonn J. Keogh,et al.  On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration , 2002, Data Mining and Knowledge Discovery.

[35]  Tim Oates,et al.  PERUSE: An unsupervised algorithm for finding recurring patterns in time series , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[36]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[37]  Michael Q. Zhang,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btl662 Sequence analysis Computing exact P-values for DNA motifs , 2022 .

[38]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[39]  Eamonn J. Keogh,et al.  Detecting time series motifs under uniform scaling , 2007, KDD '07.

[40]  Paulo J. Azevedo,et al.  Evaluating Protein Motif Significance Measures: A Case Study on Prosite Patterns , 2007, 2007 IEEE Symposium on Computational Intelligence and Data Mining.

[41]  Rong Zeng,et al.  Fractal simulation of soil breakdown under lightning current , 2004 .

[42]  Moshe Leshno,et al.  Statistical Methods for Data Mining , 2010, Data Mining and Knowledge Discovery Handbook.

[43]  Eamonn J. Keogh,et al.  HOT SAX: efficiently finding the most unusual time series subsequence , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[44]  Sami Hanhijärvi Multiple Hypothesis Testing in Pattern Discovery , 2011, Discovery Science.

[45]  Balaji Padmanabhan,et al.  On the discovery of significant statistical quantitative rules , 2004, KDD.

[46]  Mireille Régnier,et al.  Comparison of Statistical Significance Criteria , 2006, J. Bioinform. Comput. Biol..