Sequential Pattern Discovery under a Markov Assumption

In this paper we investigate the general problem of discovering recurrent patterns that are embedded in categorical sequences. An important real-world problem of this nature is motif discovery in DNA sequences. There are a number of fundamental aspects of this data mining problem that can make discovery “easy” or “hard”—we characterize the difficulty of learning in this context using an analysis based on the Bayes error rate under a Markov assumption. The Bayes error framework demonstrates why certain patterns are much harder to discover than others. It also explains the role of different parameters such as pattern length and pattern frequency in sequential discovery. We demonstrate how the Bayes error can be used to calibrate existing discovery algorithms, providing a lower bound on achievable performance. We discuss a number of fundamental issues that characterize sequential pattern discovery in this context, present a variety of empirical results to complement and verify the theoretical analysis, and apply our methodology to real-world motif-discovery problems in computational biology.

[1]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[2]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[3]  Pavel A. Pevzner,et al.  Combinatorial Approaches to Finding Subtle Signals in DNA Sequences , 2000, ISMB.

[4]  Jun S. Liu,et al.  Bayesian Models for Multiple Local Sequence Alignment and Gibbs Sampling Strategies , 1995 .

[5]  David G. Stork,et al.  Pattern Classification , 1973 .

[6]  G. Church,et al.  A comprehensive library of DNA-binding site matrices for 55 proteins applied to the complete Escherichia coli K-12 genome. , 1998, Journal of molecular biology.

[7]  Charles Elkan,et al.  Unsupervised learning of multiple motifs in biopolymers using expectation maximization , 1995, Mach. Learn..

[8]  Jeremy Buhler,et al.  Finding motifs using random projections , 2001, RECOMB.

[9]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[10]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[11]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[12]  John T. Chu Error Bounds for a Contextual Recognition Procedure , 1971, IEEE Transactions on Computers.

[13]  C. K. Chow,et al.  A Recognition Method Using Neighbor Dependence , 1962, IRE Trans. Electron. Comput..

[14]  J. Collado-Vides,et al.  Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. , 1998, Journal of molecular biology.

[15]  M. A. McClure,et al.  Hidden Markov models of biological primary sequence information. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Mikhail S. Gelfand,et al.  Finding Weak Motifs in DNA Sequences , 2001, Pacific Symposium on Biocomputing.

[17]  S. Altschul,et al.  A tool for multiple sequence alignment. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[18]  G. McLachlan Discriminant Analysis and Statistical Pattern Recognition , 1992 .

[19]  T. Kunkel,et al.  Indirect readout of DNA sequence at the primary-kink site in the CAP-DNA complex: alteration of DNA binding specificity through alteration of DNA kinking. , 2001, Journal of molecular biology.

[20]  Yuh-Jyh Hu,et al.  Detecting Motifs from Sequences , 1999, ICML.

[21]  Douglas L. Brutlag,et al.  BioProspector: Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-Expressed Genes , 2000, Pacific Symposium on Biocomputing.

[22]  Josef Raviv,et al.  Decision making in Markov chains applied to the problem of pattern recognition , 1967, IEEE Trans. Inf. Theory.

[23]  David Martin,et al.  Computational Molecular Biology: An Algorithmic Approach , 2001 .

[24]  Mireille Régnier,et al.  On the approximate pattern occurrences in a text , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[25]  Sean R. Eddy,et al.  Multiple Alignment Using Hidden Markov Models , 1995, ISMB.

[26]  Richard H. Lathrop,et al.  DNA sequence and structure: direct and indirect recognition in protein-DNA binding , 2002, ISMB.