The Peres-Shields Order Estimator for Fixed and Variable Length Markov Models with Applications to DNA Sequence Similarity

Recently Peres and Shields discovered a new method for estimating the order of a stationary fixed order Markov chain [15]. They showed that the estimator is consistent by proving a threshold result. While this threshold is valid asymptotically in the limit, it is not very useful for DNA sequence analysis where data sizes are moderate. In this paper we give a novel interpretation of the Peres-Shields estimator as a sharp transition phenomenon. This yields a precise and powerful estimator that quickly identifies the core dependencies in data. We show that it compares favorably to other estimators, especially in the presence of noise and/or variable dependencies. Motivated by this last point, we extend the Peres-Shields estimator to Variable Length Markov Chains. We give an application to the problem of detecting DNA sequence similarity using genomic signatures. Abbreviations: Mk = Fixed order Markov model of order k, PST = Prediction suffix tree, MC = Markov chain, VLMC = Variable length Markov chain.

[1]  Sean R. Eddy,et al.  Biological sequence analysis: Contents , 1998 .

[2]  Dana Ron,et al.  The power of amnesia: Learning probabilistic automata with variable memory length , 1996, Machine Learning.

[3]  Tsai-Hung Fan,et al.  A bayesian method in determining the order of a finite state markov chain , 1999 .

[4]  Terence P. Speed,et al.  Finding Short DNA Motifs Using Permuted Markov Models , 2005, J. Comput. Biol..

[5]  M. Borodovsky,et al.  Recognition of genes in DNA sequence with ambiguities. , 1993, Bio Systems.

[6]  Sean R. Eddy,et al.  Biological sequence analysis: Preface , 1998 .

[7]  Peter Bühlmann,et al.  Model Selection for Variable Length Markov Chains and Tuning the Context Algorithm , 2000 .

[8]  R. Sandberg,et al.  Capturing whole-genome characteristics in short sequences using a naïve Bayesian classifier. , 2001, Genome research.

[9]  M. Blaser,et al.  Evolutionary implications of microbial genome tetranucleotide frequency biases. , 2003, Genome research.

[10]  Peter Bühlmann,et al.  Variable Length Markov Chains: Methodology, Computing, and Software , 2004 .

[11]  D. Forsdyke,et al.  Different biological species "broadcast" their DNAs at different (G+C)% "wavelengths". , 1996, Journal of theoretical biology.

[12]  S. Karlin,et al.  Dinucleotide relative abundance extremes: a genomic signature. , 1995, Trends in genetics : TIG.

[13]  H. Akaike A new look at the statistical model identification , 1974 .

[14]  P. Bühlmann,et al.  Variable Length Markov Chains: Methodology, Computing, and Software , 2004 .

[15]  Tao Jiang,et al.  Identifying transcription factor binding sites through Markov chain optimization , 2002, ECCB.

[16]  Yuval Peres,et al.  Two new Markov order estimators , 2005 .

[17]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[18]  Devdatt P. Dubhashi,et al.  Bayesian classifiers for detecting HGT using fixed and variable order markov models of genomic signatures , 2006, Bioinform..

[19]  Golan Yona,et al.  Variations on probabilistic suffix trees: statistical modeling and prediction of protein families , 2001, Bioinform..

[20]  I. Csiszar,et al.  The consistency of the BIC Markov order estimator , 2000, 2000 IEEE International Symposium on Information Theory (Cat. No.00CH37060).