Use of Runs Statistics for Pattern Recognition in Genomic DNA Sequences

In this article, the use of the finite Markov chain imbedding (FMCI) technique to study patterns in DNA under a hidden Markov model (HMM) is introduced. With a vision of studying multiple runs-related statistics simultaneously under an HMM through the FMCI technique, this work establishes an investigation of a bivariate runs statistic under a binary HMM for DNA pattern recognition. An FMCI-based recursive algorithm is derived and implemented for the determination of the exact distribution of this bivariate runs statistic under an independent identically distributed (IID) framework, a Markov chain (MC) framework, and a binary HMM framework. With this algorithm, we have studied the distributions of the bivariate runs statistic under different binary HMM parameter sets; probabilistic profiles of runs are created and shown to be useful for trapping HMM maximum likelihood estimates (MLEs). This MLE-trapping scheme offers good initial estimates to jump-start the expectation-maximization (EM) algorithm in HMM parameter estimation and helps prevent the EM estimates from landing on a local maximum or a saddle point. Applications of the bivariate runs statistic and the probabilistic profiles in conjunction with binary HMMs for pattern recognition in genomic DNA sequences are illustrated via case studies on DNA bendability signals using human DNA data.

[1]  Michael S. Waterman,et al.  Introduction to computational biology , 1995 .

[2]  Xiao-Li Meng,et al.  The EM Algorithm—an Old Folk‐song Sung to a Fast New Tune , 1997 .

[3]  Anant P. Godbole,et al.  Specific formulae for some success run distributions , 1990 .

[4]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[5]  Gary A. Churchill,et al.  Hidden Markov Chains and the Analysis of Genome Structure , 1992, Comput. Chem..

[6]  W. Lou On Runs and Longest Run Tests: A Method of Finite Markov Chain Imbedding , 1996 .

[7]  J. Fu,et al.  DISTRIBUTION THEORY OF RUNS AND PATTERNS ASSOCIATED WITH A SEQUENCE OF MULTI-STATE TRIALS , 1996 .

[8]  Pierre Baldi,et al.  Computational Applications of DNA Structural Scales , 1998, ISMB.

[9]  Pierre Baldi,et al.  Bioinformatics - the machine learning approach (2. ed.) , 2000 .

[10]  S. Karlin,et al.  Chance and statistical significance in protein and DNA sequence analysis. , 1992, Science.

[11]  Terence P. Speed,et al.  Over- and Underrepresentation of Short DNA Words in Herpesvirus Genomes , 1996, J. Comput. Biol..

[12]  I. Brukner,et al.  Sequence‐dependent bending propensity of DNA as revealed by DNase I: parameters for trinucleotides. , 1995, The EMBO journal.

[13]  E. F. Schuster,et al.  On the conditional and unconditional distributions of the number of runs in a sample from a multisymbol alphabet , 1997 .

[14]  Pierre Baldi,et al.  Characterization of Prokaryotic and Eukaryotic Promoters Using Hidden Markov Models , 1996, ISMB.

[15]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[16]  Pierre Baldi,et al.  Structural basis for triplet repeat disorders: a computational analysis , 1999, Bioinform..

[17]  R. Jennrich,et al.  Acceleration of the EM Algorithm by using Quasi‐Newton Methods , 1997 .

[18]  D. Rubin,et al.  The ECME algorithm: A simple extension of EM and ECM with faster monotone convergence , 1994 .

[19]  P. Baldi,et al.  DNA structure in human RNA polymerase II promoters. , 1998, Journal of molecular biology.

[20]  M. Puterman,et al.  Maximum-penalized-likelihood estimation for independent and Markov-dependent mixture models. , 1992, Biometrics.

[21]  G. Churchill Stochastic models for heterogeneous DNA sequences. , 1989, Bulletin of mathematical biology.

[22]  S. Eddy Hidden Markov models. , 1996, Current opinion in structural biology.

[23]  D. Rubin,et al.  Parameter expansion to accelerate EM : The PX-EM algorithm , 1997 .

[24]  D. Haussler,et al.  A hidden Markov model that finds genes in E. coli DNA. , 1994, Nucleic acids research.

[25]  Markos V. Koutras,et al.  Distribution Theory of Runs: A Markov Chain Approach , 1994 .

[26]  Michael S. Waterman,et al.  Introduction to Computational Biology: Maps, Sequences and Genomes , 1998 .

[27]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[28]  Pierre Baldi,et al.  The Biology of Eukaryotic Promoter Prediction - A Review , 1999, Comput. Chem..

[29]  Keisuke Hirano,et al.  Some Properties of the Distributions of Order k , 1986 .

[30]  D. Balding,et al.  Handbook of statistical genetics , 2004 .

[31]  A. Mood The Distribution Theory of Runs , 1940 .

[32]  D. Rubin,et al.  Parameter expansion to accelerate EM: The PX-EM algorithm , 1998 .

[33]  F N DAVID A power function for tests of randomness in a sequence of alternatives. , 1947, Biometrika.