Finding Motifs in Time Series

The problem of efficiently locating previously known patterns in a time series database (i.e., query by content) has received much attention and may now largely be regarded as a solved problem. However, from a knowledge discovery viewpoint, a more interesting problem is the enumeration of previously unknown, frequently occurring patterns. We call such patterns “motifs,” because of their close analogy to their discrete counterparts in computation biology. An efficient motif discovery algorithm for time series would be useful as a tool for summarizing and visualizing massive time series databases. In addition, it could be used as a subroutine in various other data mining tasks, including the discovery of association rules, clustering and classification. In this work we carefully motivate, then introduce, a non-trivial definition of time series motifs. We propose an efficient algorithm to discover them, and we demonstrate the utility and efficiency of our approach on several real world datasets.

[1]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[2]  Stefano Lonardi,et al.  Monotony of surprise and large-scale quest for unusual words. , 2003 .

[3]  Eamonn J. Keogh,et al.  Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases , 2001, Knowledge and Information Systems.

[4]  Henrik André-Jönsson,et al.  Using Signature Files for Querying Time-Series Data , 1997, PKDD.

[5]  Jeremy Buhler,et al.  Finding motifs using random projections , 2001, RECOMB.

[6]  L. K. Hansen,et al.  On Clustering fMRI Time Series , 1999, NeuroImage.

[7]  Philip S. Yu,et al.  Mining long sequential patterns in a noisy environment , 2002, SIGMOD '02.

[8]  Ada Wai-Chee Fu,et al.  Efficient time series matching by wavelets , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[9]  Markus Hegland,et al.  Mining the MACHO dataset , 2001 .

[10]  Pavel A. Pevzner,et al.  Combinatorial Approaches to Finding Subtle Signals in DNA Sequences , 2000, ISMB.

[11]  Padhraic Smyth,et al.  Deformable Markov model templates for time-series pattern matching , 2000, KDD '00.

[12]  J. Collado-Vides,et al.  Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. , 1998, Journal of molecular biology.

[13]  Christos Faloutsos,et al.  Fast Time Sequence Indexing for Arbitrary Lp Norms , 2000, VLDB.

[14]  Frank Höppner Discovery of Temporal Patterns. Learning Rules about the Qualitative Behaviour of Time Series , 2001, PKDD.

[15]  Philip S. Yu,et al.  Meta-patterns: revealing hidden periodic patterns , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[16]  Haixun Wang,et al.  Landmarks: a new model for similarity-based pattern querying in time series databases , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[17]  Eamonn J. Keogh,et al.  Locally adaptive dimensionality reduction for indexing large time series databases , 2001, SIGMOD '01.

[18]  Gesine Reinert,et al.  Probabilistic and Statistical Properties of Words: An Overview , 2000, J. Comput. Biol..

[19]  Giuseppe Psaila,et al.  Querying Shapes of Histories , 1995, VLDB.

[20]  Rodger Staden,et al.  Methods for discovering novel motifs in nucleic acid sequences , 1989, Comput. Appl. Biosci..

[21]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[22]  Paul S. Bradley,et al.  Initialization of Iterative Refinement Clustering Algorithms , 1998, KDD.

[23]  Dimitrios Gunopulos,et al.  Discovering similar multidimensional trajectories , 2002, Proceedings 18th International Conference on Data Engineering.

[24]  Philip S. Yu,et al.  Adaptive query processing for time-series data , 1999, KDD '99.

[25]  Dennis Shasha,et al.  New techniques for best-match retrieval , 1990, TOIS.

[26]  Piotr Indyk,et al.  Identifying Representative Trends in Massive Time Series Data Sets Using Sketches , 2000, VLDB.

[27]  Philip S. Yu,et al.  MALM: a framework for mining sequence database at multiple abstraction levels , 1998, CIKM '98.

[28]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[29]  Patrick Brézillon,et al.  Lecture Notes in Artificial Intelligence , 1999 .

[30]  Christian Böhm,et al.  Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data , 2001, SIGMOD '01.

[31]  John F. Roddick,et al.  An Updated Bibliography of Temporal, Spatial, and Spatio-temporal Data Mining Research , 2000, TSDM.

[32]  Heikki Mannila,et al.  Rule Discovery from Time Series , 1998, KDD.

[33]  Martti Juhola,et al.  Syntactic recognition of ECG signals by attributed finite automata , 1995, Pattern Recognit..

[34]  Walter A. Burkhard,et al.  Some approaches to best-match file searching , 1973, Commun. ACM.

[35]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[36]  Eamonn J. Keogh,et al.  An Enhanced Representation of Time Series Which Allows Fast and Accurate Classification, Clustering and Relevance Feedback , 1998, KDD.

[37]  Jeremy Buhler,et al.  Finding Motifs Using Random Projections , 2002, J. Comput. Biol..

[38]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[39]  Stefano Lonardi,et al.  Global detectors of unusual words: design, implementation, and applications to pattern discovery in biosequences , 2001 .

[40]  Zbigniew R. Struzik,et al.  Measuring time series similarity through large singular features revealed with wavelet transformation , 1999, Proceedings. Tenth International Workshop on Database and Expert Systems Applications. DEXA 99.

[41]  Konstantinos Kalpakis,et al.  Distance measures for effective clustering of ARIMA time-series , 2001, Proceedings 2001 IEEE International Conference on Data Mining.