Analysis of Pattern Discovery in Sequences Using a Bayes Error Framework

In this paper we investigate the general problem of discovering recurrent patterns that are embedded in categorical sequences. An important real-world problem of this nature is motif discovery in DNA sequences. There are a number of fundamental aspects of this data mining problem that can make discovery “easy” or “hard”—we characterize the difficulty of this problem using an analysis based on the Bayes error rate under a Markov assumption. The Bayes error framework demonstrates why certain patterns are much harder to discover than others. It also explains the role of different parameters such as pattern length and pattern frequency in sequential discovery. We demonstrate how the Bayes error can be used to calibrate existing discovery algorithms, providing a lower bound on achievable performance. We discuss a number of fundamental issues that characterize sequential pattern discovery in this context, present a variety of empirical results to complement and verify the theoretical analysis, and apply our methodology to real-world motif-discovery problems in computational biology.

[1]  C. K. Chow,et al.  A Recognition Method Using Neighbor Dependence , 1962, IRE Trans. Electron. Comput..

[2]  Josef Raviv,et al.  Decision making in Markov chains applied to the problem of pattern recognition , 1967, IEEE Trans. Inf. Theory.

[3]  John T. Chu Error Bounds for a Contextual Recognition Procedure , 1971, IEEE Transactions on Computers.

[4]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[5]  David G. Stork,et al.  Pattern Classification , 1973 .

[6]  S. Altschul,et al.  A tool for multiple sequence alignment. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[8]  G. McLachlan Discriminant Analysis and Statistical Pattern Recognition , 1992 .

[9]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[10]  M. A. McClure,et al.  Hidden Markov models of biological primary sequence information. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Sean R. Eddy,et al.  Multiple Alignment Using Hidden Markov Models , 1995, ISMB.

[12]  Jun S. Liu,et al.  Bayesian Models for Multiple Local Sequence Alignment and Gibbs Sampling Strategies , 1995 .

[13]  Charles Elkan,et al.  Unsupervised learning of multiple motifs in biopolymers using expectation maximization , 1995, Mach. Learn..

[14]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[15]  Mireille Régnier,et al.  On the approximate pattern occurrences in a text , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[16]  J. Collado-Vides,et al.  Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. , 1998, Journal of molecular biology.

[17]  G. Church,et al.  A comprehensive library of DNA-binding site matrices for 55 proteins applied to the complete Escherichia coli K-12 genome. , 1998, Journal of molecular biology.

[18]  Yuh-Jyh Hu,et al.  Detecting Motifs from Sequences , 1999, ICML.

[19]  Pavel A. Pevzner,et al.  Computational molecular biology : an algorithmic approach , 2000 .

[20]  Pavel A. Pevzner,et al.  Combinatorial Approaches to Finding Subtle Signals in DNA Sequences , 2000, ISMB.

[21]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[22]  Jeremy Buhler,et al.  Finding motifs using random projections , 2001, RECOMB.

[23]  Douglas L. Brutlag,et al.  BioProspector: Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-Expressed Genes , 2000, Pacific Symposium on Biocomputing.

[24]  T. Kunkel,et al.  Indirect readout of DNA sequence at the primary-kink site in the CAP-DNA complex: alteration of DNA binding specificity through alteration of DNA kinking. , 2001, Journal of molecular biology.

[25]  Uri Keich,et al.  U Subtle motifs: defining the limits of motif finding algorithms , 2002, Bioinform..

[26]  Mikhail S. Gelfand,et al.  Finding Weak Motifs in DNA Sequences , 2001, Pacific Symposium on Biocomputing.

[27]  Padhraic Smyth,et al.  Pattern discovery in sequences under a Markov assumption , 2002, KDD.

[28]  K. Cara Woodwark,et al.  Intelligent Systems for Molecular Biology 2002 (ISMB02) , 2002, Comparative and functional genomics.

[29]  Uri Keich,et al.  Finding motifs in the twilight zone , 2002, RECOMB '02.

[30]  Richard H. Lathrop,et al.  DNA sequence and structure: direct and indirect recognition in protein-DNA binding , 2002, ISMB.