Bacterial genomes lacking long-range correlations may not be modeled by low-order Markov chains: The role of mixing statistics and frame shift of neighboring genes

We examine the relationship between exponential correlation functions and Markov models in a bacterial genome in detail. Despite the well known fact that Markov models generate sequences with correlation function that decays exponentially, simply constructed Markov models based on nearest-neighbor dimer (first-order), trimer (second-order), up to hexamer (fifth-order), and treating the DNA sequence as being homogeneous all fail to predict the value of exponential decay rate. Even reading-frame-specific Markov models (both first- and fifth-order) could not explain the fact that the exponential decay is very slow. Starting with the in-phase coding-DNA-sequence (CDS), we investigated correlation within a fixed-codon-position subsequence, and in artificially constructed sequences by packing CDSs with out-of-phase spacers, as well as altering CDS length distribution by imposing an upper limit. From these targeted analyses, we conclude that the correlation in the bacterial genomic sequence is mainly due to a mixing of heterogeneous statistics at different codon positions, and the decay of correlation is due to the possible out-of-phase between neighboring CDSs. There are also small contributions to the correlation from bases at the same codon position, as well as by non-coding sequences. These show that the seemingly simple exponential correlation functions in bacterial genome hide a complexity in correlation structure which is not suitable for a modeling by Markov chain in a homogeneous sequence. Other results include: use of the (absolute value) second largest eigenvalue to represent the 16 correlation functions and the prediction of a 10-11 base periodicity from the hexamer frequencies.

[1]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[2]  Wentian Li,et al.  Universal 1/f noise, crossovers of scaling exponents, and chromosome-specific patterns of guanine-cytosine content in DNA sequences of the human genome. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[3]  Junwen Wang,et al.  Generalizations of Markov model to characterize biological sequences , 2005, BMC Bioinformatics.

[4]  J. Lobry Asymmetric substitution patterns in the two DNA strands of bacteria. , 1996, Molecular biology and evolution.

[5]  K. Dill,et al.  A maximum entropy framework for nonexponential distributions , 2013, Proceedings of the National Academy of Sciences.

[6]  Gene-Wei Li,et al.  The anti-Shine-Dalgarno sequence drives translational pausing and codon choice in bacteria , 2012, Nature.

[7]  Arend Hintze,et al.  Scaling metagenome sequence assembly with probabilistic de Bruijn graphs , 2011, Proceedings of the National Academy of Sciences.

[8]  F. De Amicis,et al.  Intercodon dinucleotides affect codon choice in plant genes. , 2000, Nucleic acids research.

[9]  D. Haussler,et al.  A hidden Markov model that finds genes in E. coli DNA. , 1994, Nucleic acids research.

[10]  Mark Borodovsky,et al.  Deriving Non-homogeneous DNA Markov Chain Models by Cluster Analysis Algorithm Minimizing Multiple Alignment Entropy , 1994, Comput. Chem..

[11]  C. Fuchs On the distribution of the nucleotides in seven completely sequenced DNAs. , 1980, Gene.

[12]  Antonio Marín,et al.  Preference for guanosine at first codon position in highly expressed Escherichia coli genes. A relationship with translational efficiency , 1996, Nucleic Acids Res..

[13]  Information decomposition of symbolic sequences , 2003, math/0302195.

[14]  Leandro Pardo,et al.  Testing the Order of Markov Dependence in DNA Sequences , 2011 .

[15]  Daniel A. Henderson,et al.  Fitting Markov chain models to discrete state series such as DNA sequences , 1999 .

[16]  Françoise Argoul,et al.  Multi-scale coding of genomic information: From DNA sequence to genome structure and function , 2011 .

[17]  S Karlin,et al.  Patchiness and correlations in DNA sequences , 1993, Science.

[18]  Uwe Hassler,et al.  Nonsensical and biased correlation due to pooling heterogeneous samples , 2003 .

[19]  Jan Komorowski,et al.  Nucleosomes are well positioned in exons and carry characteristic histone modifications. , 2009, Genome research.

[20]  Simon Tavaré,et al.  Codon preference and primary sequence structure in protein-coding regions , 1989 .

[21]  Sean R. Eddy,et al.  Biological sequence analysis: Preface , 1998 .

[22]  D. Vere-Jones Markov Chains , 1972, Nature.

[23]  Latent Periodicity of Protein Sequences , 1999 .

[24]  V. Tumanyan,et al.  Coexistence of different base periodicities in prokaryotic genomes as related to DNA curvature, supercoiling, and transcription. , 2011, Genomics.

[25]  Wentian Li Mutual information functions versus correlation functions , 1990 .

[26]  Wentian Li,et al.  Three lectures on case-control genetic association analysis , 2007, Briefings Bioinform..

[27]  Frank H. Eeckman,et al.  Principal Component Analysis and Large-Scale Correlations in Non-Coding Sequences of Human DNA , 1996, J. Comput. Biol..

[28]  J. Sánchez,et al.  Analysis of bilateral inverse symmetry in whole bacterial chromosomes. , 2002, Biochemical and biophysical research communications.

[29]  Bilal Salih,et al.  Visible periodicity of strong nucleosome DNA sequences , 2015, Journal of biomolecular structure & dynamics.

[30]  G Bernardi,et al.  Compositional heterogeneity within and among isochores in mammalian genomes. I. CsCl and sequence analyses. , 2001, Gene.

[31]  Hanspeter Herzel,et al.  Correlations in DNA sequences: The role of protein coding segments , 1997 .

[32]  Sergey V. Buldyrev,et al.  Power Law Correlations in DNA Sequences , 2013 .

[33]  Peter Avery,et al.  Fitting interconnected Markov chain models—DNA sequences and test cricket matches , 2002 .

[34]  B. Blaisdell,et al.  Markov chain analysis finds a significant influence of neighboring bases on the occurrence of a base in eucaryotic nuclear DNA sequences both protein-coding and noncoding , 1985, Journal of Molecular Evolution.

[35]  Astero Provata,et al.  Complexity measures for the evolutionary categorization of organisms , 2014, Comput. Biol. Chem..

[36]  Wentian Li,et al.  The Study of Correlation Structures of DNA Sequences: A Critical Review , 1997, Comput. Chem..

[37]  Wentian Li The Measure of Compositional Heterogeneity in DNA Sequences Is Related to Measures of Complexity , 1997, adap-org/9709007.

[38]  Yechezkel Kashi,et al.  Three Sequence Rules for Chromatin , 2006, Journal of biomolecular structure & dynamics.

[39]  E N Trifonov,et al.  Sequence Structure of Hidden 10.4-base Repeat in the Nucleosomes of C. elegans , 2008, Journal of biomolecular structure & dynamics.

[40]  Wentian Li,et al.  Periodic Distribution of a Putative Nucleosome Positioning Motif in Human, Nonhuman Primates, and Archaea: Mutual Information Analysis , 2013, International journal of genomics.

[41]  J. Shine,et al.  The 3'-terminal sequence of Escherichia coli 16S ribosomal RNA: complementarity to nonsense triplets and ribosome binding sites. , 1974, Proceedings of the National Academy of Sciences of the United States of America.

[42]  Daniel Segrè,et al.  Chromosomal periodicity of evolutionarily conserved gene pairs , 2007, Proceedings of the National Academy of Sciences.

[43]  V. Yampol’skii,et al.  Binary N-step Markov chains and long-range correlated systems. , 2003, Physical review letters.

[44]  Pavel A Pevzner,et al.  How to apply de Bruijn graphs to genome assembly. , 2011, Nature biotechnology.

[45]  Wentian Li The complexity of DNA , 1997 .

[46]  A. Raftery,et al.  Estimation and Modelling Repeated Patterns in High Order Markov Chains with the Mixture Transition Distribution Model , 1994 .

[47]  T. Haran,et al.  The coexistence of the nucleosome positioning code with the genetic code on eukaryotic genomes , 2009, Nucleic acids research.

[48]  E. Trifonov,et al.  The pitch of chromatin DNA is reflected in its nucleotide sequence. , 1980, Proceedings of the National Academy of Sciences of the United States of America.

[49]  P. Bernaola-Galván,et al.  Compositional segmentation and long-range fractal correlations in DNA sequences. , 1996, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[50]  Durbin,et al.  Biological Sequence Analysis , 1998 .

[51]  Dimitris Kugiumtzis,et al.  Investigating long range correlation in DNA sequences using significance tests of conditional mutual information , 2014, Comput. Biol. Chem..

[52]  David Haussler,et al.  A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA , 1996, ISMB.

[53]  Joaquín Sánchez Sequences encoding identical peptides for the analysis and manipulation of coding DNA , 2013, Bioinformation.

[54]  Wentian Li,et al.  Long-range correlation and partial 1/fα spectrum in a noncoding DNA sequence , 1992 .

[55]  Jan Beran,et al.  Statistics for long-memory processes , 1994 .

[56]  Hanspeter Herzel,et al.  10-11 bp periodicities in complete genomes reflect protein structure and DNA folding , 1999, Bioinform..

[57]  Gill Bejerano Algorithms for variable length Markov chain modeling , 2004, Bioinform..

[58]  Tetsuya Hayashi,et al.  Complete Genome Sequence and Comparative Genome Analysis of Enteropathogenic Escherichia coli O127:H6 Strain E2348/69 , 2008, Journal of bacteriology.

[59]  S. Franz,et al.  Critical Phenomena in Natural Sciences: Chaos, Fractals, Selforganization and Disorder: Concepts and Tools , 2004 .

[60]  Eugene V. Korotkov,et al.  Latent sequence periodicity of some oncogenes and DNA-binding protein genes , 1997, Comput. Appl. Biosci..

[61]  David Mary Rajathei,et al.  Analysis of sequence repeats of proteins in the PDB , 2013, Comput. Biol. Chem..

[62]  E. Trifonov 3-, 10.5-, 200- and 400-base periodicities in genome sequences , 1998 .

[63]  W Li,et al.  Compositional heterogeneity within, and uniformity between, DNA sequences of yeast chromosomes. , 1998, Genome research.

[64]  Jan Beran,et al.  Long-Memory Processes: Probabilistic Properties and Statistical Methods , 2013 .

[65]  S. Salzberg,et al.  Microbial gene identification using interpolated Markov models. , 1998, Nucleic acids research.

[66]  Lenwood S. Heath,et al.  Genomic Signatures in De Bruijn Chains , 2007, WABI.

[67]  I. Grosse,et al.  MEASURING CORRELATIONS IN SYMBOL SEQUENCES , 1995 .

[68]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[69]  Liisa Holm,et al.  Rapid automatic detection and alignment of repeats in protein sequences , 2000, Proteins.

[70]  P Bernaola-Galván,et al.  Study of statistical correlations in DNA sequences. , 2002, Gene.

[71]  P W Garden,et al.  Markov analysis of viral DNA/RNA sequences. , 1980, Journal of theoretical biology.

[72]  A. Cuticchia,et al.  Influence of intercodon and base frequencies on codon usage in filarial parasites. , 2001, Genomics.

[73]  G Bernardi,et al.  Compositional heterogeneity within and among isochores in mammalian genomes. II. Some general comments. , 2001, Gene.

[74]  Amrita Pati Graph-based genomic signatures , 2008 .

[75]  R. Voss,et al.  Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. , 1992, Physical review letters.

[76]  D. Eisenberg,et al.  A census of protein repeats. , 1999, Journal of molecular biology.

[77]  A MARKOV MODEL FOR PROTEIN SEQUENCES , 2006 .

[78]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[79]  Nikolai A. Kudryashov,et al.  Information decomposition method to analyze symbolical sequences , 2003 .