A Bayesian framework for inferring the influence of sequence context on single base modifications

The probability of single base modifications (mutations and DNA/RNA modifications) is expected to be highly influenced by the flanking nucleotides that surround them, known as the sequence context. This phenomenon may be mainly attributed to the enzyme that modifies or mutates the genetic material, since most enzymes tend to have specific sequence contexts that dictate their activity. Thus, identification of context effects may lead to the discovery of additional editing sites or unknown enzymatic factors. Here, we develop a statistical model that allows for the detection and evaluation of the effects of different sequence contexts on mutation rates from deep population sequencing data. This task is computationally challenging, as the complexity of the model increases exponentially as the context size increases. We established our novel Bayesian method based on sparse model selection methods, with the leading assumption that the number of actual sequence contexts that directly influence mutation rates is minuscule compared to the number of possible sequence contexts. We show that our method is highly accurate on simulated data using pentanucleotide contexts, even when accounting for noisy data. We next analyze empirical population sequencing data from polioviruses and detect a significant enrichment in sequence contexts associated with deamination by the cellular deaminases ADAR 1/2. In the current era, where next generation sequencing data is highly abundant, our approach can be used on any population sequencing data to reveal context-dependent base alterations, and may assist in the discovery of novel mutable sites or editing sites.

[1]  Alan Hodgkinson,et al.  Variation in the mutation rate across mammalian genomes , 2011, Nature Reviews Genetics.

[2]  Zipora Y. Fligelman,et al.  Systematic identification of abundant A-to-I editing sites in the human transcriptome , 2004, Nature Biotechnology.

[3]  Fumiyasu Komaki,et al.  Determinantal Point Process Priors for Bayesian Variable Selection in Linear Regression , 2014, 1406.2100.

[4]  L. Hurst,et al.  Hearing silence: non-neutral evolution at synonymous sites in mammals , 2006, Nature Reviews Genetics.

[5]  D. Gillespie Exact Stochastic Simulation of Coupled Chemical Reactions , 1977 .

[6]  F. Rottman,et al.  Context effects on N6-adenosine methylation sites in prolactin mRNA. , 1994, Nucleic acids research.

[7]  R. Nielsen,et al.  The Evolutionary Pathway to Virulence of an RNA Virus , 2017, Cell.

[8]  Michael B. Schulte,et al.  Experimentally guided models reveal replication principles that shape the mutation distribution of RNA viruses , 2015, eLife.

[9]  Tiejun Li,et al.  Efficient simulation under a population genetics model of carcinogenesis , 2011, Bioinform..

[10]  Russ B. Altman,et al.  GAPSCORE: finding gene and protein names one word at a time , 2004, Bioinform..

[11]  C. Plass,et al.  DNA motifs associated with aberrant CpG island methylation. , 2006, Genomics.

[12]  Philip J. Farabaugh,et al.  Molecular basis of base substitution hotspots in Escherichia coli , 1978, Nature.

[13]  David J. Anderson,et al.  Ventromedial hypothalamic neurons control a defensive emotion state , 2015, eLife.

[14]  P. Simmonds,et al.  Modelling mutational and selection pressures on dinucleotides in eukaryotic phyla –selection against CpG and UpA in cytoplasmically expressed RNA and in RNA viruses , 2013, BMC Genomics.

[15]  S. Hess,et al.  The influence of nearest neighbors on the rate and pattern of spontaneous point mutations , 1992, Journal of Molecular Evolution.

[16]  Brenda L. Bass,et al.  Predicting sites of ADAR editing in double-stranded RNA , 2011, Nature communications.

[17]  R. Sanjuán,et al.  Extremely High Mutation Rate of HIV-1 In Vivo , 2015, PLoS biology.

[18]  IMS Collections Borrowing Strength : Theory Powering Applications – A Festschrift for , 2010 .

[19]  Nengjun Yi,et al.  A Unified Markov Chain Monte Carlo Framework for Mapping Multiple Quantitative Trait Loci , 2004, Genetics.

[20]  A. Riggs,et al.  DNA methylation and gene function. , 1980, Science.

[21]  S. Sugano,et al.  A Human Transcriptome , 2004, Science.

[22]  Bjorn-Erik Wulff,et al.  Elucidating the inosinome: global approaches to adenosine-to-inosine RNA editing , 2011, Nature Reviews Genetics.

[23]  O. Mor,et al.  Accurate in vivo population sequencing uncovers drivers of within-host genetic diversity in viruses , 2018, bioRxiv.

[24]  Jotun Hein,et al.  A nucleotide substitution model with nearest-neighbour interactions , 2004, ISMB/ECCB.

[25]  Michael Krawczak,et al.  Cytosine methylation and the fate of CpG dinucleotides in vertebrate genomes , 1989, Human Genetics.

[26]  Benjamin F. Voight,et al.  Nature Genetics Advance Online Publication a N a Ly S I S an Expanded Sequence Context Model Broadly Explains Variability in Polymorphism Levels across the Human Genome , 2022 .

[27]  Raul Andino,et al.  Mutational and fitness landscapes of an RNA virus revealed through population sequencing , 2013, Nature.

[28]  P. Green,et al.  Bayesian Markov chain Monte Carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[29]  Vivian G. Cheung,et al.  ADAR Regulates RNA Editing, Transcript Stability, and Gene Expression , 2013, Cell reports.

[30]  K. A. Lehmann,et al.  Double-stranded RNA adenosine deaminases ADAR1 and ADAR2 have overlapping specificities. , 2000, Biochemistry.