Inference of Markovian properties of molecular sequences from NGS data and applications to comparative genomics

MOTIVATION Next-generation sequencing (NGS) technologies generate large amounts of short read data for many different organisms. The fact that NGS reads are generally short makes it challenging to assemble the reads and reconstruct the original genome sequence. For clustering genomes using such NGS data, word-count based alignment-free sequence comparison is a promising approach, but for this approach, the underlying expected word counts are essential.A plausible model for this underlying distribution of word counts is given through modeling the DNA sequence as a Markov chain (MC). For single long sequences, efficient statistics are available to estimate the order of MCs and the transition probability matrix for the sequences. As NGS data do not provide a single long sequence, inference methods on Markovian properties of sequences based on single long sequences cannot be directly used for NGS short read data. RESULTS Here we derive a normal approximation for such word counts. We also show that the traditional Chi-square statistic has an approximate gamma distribution ,: using the Lander-Waterman model for physical mapping. We propose several methods to estimate the order of the MC based on NGS reads and evaluate those using simulations. We illustrate the applications of our results by clustering genomic sequences of several vertebrate and tree species based on NGS reads using alignment-free sequence dissimilarity measures. We find that the estimated order of the MC has a considerable effect on the clustering results ,: and that the clustering results that use a N: MC of the estimated order give a plausible clustering of the species. AVAILABILITY AND IMPLEMENTATION Our implementation of the statistics developed here is available as R package 'NGS.MC' at http://www-rcf.usc.edu/∼fsun/Programs/NGS-MC/NGS-MC.html CONTACT fsun@usc.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Yuval Peres,et al.  Two new Markov order estimators , 2005 .

[2]  Jessica M. Hoffman,et al.  The ‘Expansion–Contraction’ model of Pleistocene biogeography: rocky shores suffer a sea change? , 2010, Molecular ecology.

[3]  Andrew D. Smith,et al.  The Amordad database engine for metagenomics , 2014, Bioinform..

[4]  P. Avery,et al.  The analysis of intron data and their use in the detection of short signals , 2005, Journal of Molecular Evolution.

[5]  Clifford M. Hurvich,et al.  Regression and time series model selection in small samples , 1989 .

[6]  Mark Gerstein,et al.  Modeling ChIP Sequencing In Silico with Applications , 2008, PLoS Comput. Biol..

[7]  David Haussler,et al.  The UCSC Genome Browser Database: 2008 update , 2007, Nucleic Acids Res..

[8]  Bonnie L Hurwitz,et al.  Modeling ecological drivers in marine viral communities using comparative metagenomics and network analyses , 2014, Proceedings of the National Academy of Sciences.

[9]  H. Tong Determination of the order of a Markov chain by Akaike's information criterion , 1975, Journal of Applied Probability.

[10]  J. Harting,et al.  Assembly free comparative genomics of short‐read sequence data discovers the needles in the haystack , 2010, Molecular ecology.

[11]  J. Besag,et al.  Exact Goodness‐of‐Fit Tests for Markov Chains , 2013, Biometrics.

[12]  Kai Song,et al.  Alignment-Free Sequence Comparison Based on Next-Generation Sequencing Reads , 2013, J. Comput. Biol..

[13]  Kai Song,et al.  New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing , 2014, Briefings Bioinform..

[14]  Paul G. Hoel,et al.  A TEST FOR MARKOFF CHAINS , 1954 .

[15]  Gesine Reinert,et al.  Alignment-Free Sequence Comparison (I): Statistics and Power , 2009, J. Comput. Biol..

[16]  M. Lothaire,et al.  Applied Combinatorics on Words: Statistics on Words with Applications to Biological Sequences , 2005 .

[17]  Kai Song,et al.  Normal and Compound Poisson Approximations for Pattern Occurrences in NGS Reads , 2012, J. Comput. Biol..

[18]  Susan R. Wilson INTRODUCTION TO COMPUTATIONAL BIOLOGY: MAPS, SEQUENCES AND GENOMES. , 1996 .

[19]  B. Blaisdell,et al.  Markov chain analysis finds a significant influence of neighboring bases on the occurrence of a base in eucaryotic nuclear DNA sequences both protein-coding and noncoding , 1985, Journal of Molecular Evolution.

[20]  Ying Wang,et al.  Comparison of Metatranscriptomic Samples Based on k-Tuple Frequencies , 2014, PloS one.

[21]  Benjamin Weiss,et al.  Order estimation of Markov chains , 2005, IEEE Transactions on Information Theory.

[22]  Sanjeev Galande,et al.  One size does not fit all: On how Markov model order dictates performance of genomic sequence analyses , 2012, Nucleic acids research.

[23]  Daniel J. Blankenberg,et al.  28-way vertebrate alignment and conservation track in the UCSC Genome Browser. , 2007, Genome research.

[24]  P. Billingsley,et al.  Statistical Methods in Markov Chains , 1961 .

[25]  A. Visel,et al.  ChIP-Seq identification of weakly conserved heart enhancers , 2010, Nature Genetics.

[26]  C. R. Gonçalves,et al.  On Determination of the Order of a Markov Chain , 2001 .

[27]  T. Glenn Field guide to next‐generation DNA sequencers , 2011, Molecular ecology resources.

[28]  R. Katz On Some Criteria for Estimating the Order of a Markov Chain , 1981 .

[29]  Patrick Billingsley,et al.  Statistical inference for Markov processes , 1961 .

[30]  Daniel A. Henderson,et al.  Fitting Markov chain models to discrete state series such as DNA sequences , 1999 .

[31]  Gesine Reinert,et al.  New powerful statistics for alignment-free sequence comparison under a pattern transfer model. , 2011, Journal of theoretical biology.

[32]  P. Pevzner,et al.  Linguistics of nucleotide sequences. I: The significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words. , 1989, Journal of biomolecular structure & dynamics.

[33]  Michael S. Waterman,et al.  Introduction to computational biology , 1995 .

[34]  Gregory R. Grant,et al.  Statistical Methods in Bioinformatics , 2001 .

[35]  Daniel H. Huson,et al.  MetaSim—A Sequencing Simulator for Genomics and Metagenomics , 2008, PloS one.

[36]  J Hong,et al.  Prediction of oligonucleotide frequencies based upon dinucleotide frequencies obtained from the nearest neighbor analysis. , 1990, Nucleic acids research.

[37]  T. W. Anderson,et al.  Statistical Inference about Markov Chains , 1957 .

[38]  H Almagor,et al.  A Markov analysis of DNA sequences. , 1983, Journal of theoretical biology.

[39]  L. Baum,et al.  Statistical Inference for Probabilistic Functions of Finite State Markov Chains , 1966 .

[40]  Daniel H. Huson,et al.  48. MetaSim: A Sequencing Simulator for Genomics and Metagenomics , 2011 .

[41]  Gesine Reinert,et al.  Probabilistic and Statistical Properties of Words: An Overview , 2000, J. Comput. Biol..

[42]  Minghua Deng,et al.  Comparison of metagenomic samples using sequence signatures , 2012, BMC Genomics.

[43]  Timothy Daley,et al.  Predicting the molecular complexity of sequencing libraries , 2013, Nature Methods.

[44]  Christopher C. Strelioff,et al.  Inferring Markov chains: Bayesian estimation, model comparison, entropy rate, and out-of-class modeling. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[45]  Martin Vingron,et al.  Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts , 2012, Bioinform..

[46]  Gregory R. Grant,et al.  Statistical Methods in Bioinformatics , 2001 .

[47]  E. Lander,et al.  Genomic mapping by fingerprinting random clones: a mathematical analysis. , 1988, Genomics.

[48]  Patrick Billingsley,et al.  Statistical inference for Markov processes , 1961 .

[49]  Y. Benjamini,et al.  Summarizing and correcting the GC content bias in high-throughput sequencing , 2012, Nucleic acids research.

[50]  Paul Schliekelman,et al.  Statistical Methods in Bioinformatics: An Introduction , 2001 .

[51]  Andrew D. Smith,et al.  A Geometric Interpretation for Local Alignment-Free Sequence Comparison , 2013, J. Comput. Biol..

[52]  Jared T. Simpson,et al.  Exploring genome characteristics and sequence quality without a reference , 2013, Bioinform..

[53]  A J Cuticchia,et al.  Mono- through hexanucleotide composition of the sense strand of yeast DNA: a Markov chain analysis. , 1988, Nucleic acids research.

[54]  Huiguang Yi,et al.  Co-phylog: an assembly-free phylogenomic approach for closely related organisms , 2010, Nucleic acids research.

[55]  Kai Song,et al.  Multiple alignment-free sequence comparison , 2013, Bioinform..