Inference and visualization of DNA damage patterns using a grade of membership model

Motivation Quality control plays a major role in the analysis of ancient DNA (aDNA). One key step in this quality control is assessment of DNA damage: aDNA contains unique signatures of DNA damage that distinguish it from modern DNA, and so analyses of damage patterns can help confirm that DNA sequences obtained are from endogenous aDNA rather than from modern contamination. Predominant signatures of DNA damage include a high frequency of cytosine to thymine substitutions (C‐to‐T) at the ends of fragments, and elevated rates of purines (A & G) before the 5′ strand‐breaks. Existing QC procedures help assess damage by simply plotting for each sample, the C‐to‐T mismatch rate along the read and the composition of bases before the 5′ strand‐breaks. Here we present a more flexible and comprehensive model‐based approach to infer and visualize damage patterns in aDNA, implemented in an R package aRchaic. This approach is based on a ‘grade of membership’ model (also known as ‘admixture’ or ‘topic’ model) in which each sample has an estimated grade of membership in each of K damage profiles that are estimated from the data. Results We illustrate aRchaic on data from several aDNA studies and modern individuals from 1000 Genomes Project Consortium (2012). Here, aRchaic clearly distinguishes modern from ancient samples irrespective of DNA extraction, lab and sequencing protocols. Additionally, through an in‐silico contamination experiment, we show that the aRchaic grades of membership reflect relative levels of exogenous modern contamination. Together, the outputs of aRchaic provide a concise visual summary of DNA damage patterns, as well as other processes generating mismatches in the data. Availability and implementation aRchaic is available for download from https://www.github.com/kkdey/aRchaic. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[2]  Philip L. F. Johnson,et al.  Patterns of damage in genomic DNA sequences from a Neandertal , 2007, Proceedings of the National Academy of Sciences.

[3]  Matthew Stephens,et al.  A new sequence logo plot to highlight enrichment and depletion , 2017 .

[4]  E. Willerslev,et al.  More on contamination: the use of asymmetric molecular behavior to identify authentic ancient human DNA. , 2007, Molecular biology and evolution.

[5]  Søren Brunak,et al.  Population genomics of Bronze Age Eurasia , 2015, Nature.

[6]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[7]  Qiaomei Fu,et al.  A mitochondrial genome sequence of a hominin from Sima de los Huesos , 2013, Nature.

[8]  Heng Li,et al.  Genome sequence of a 45,000-year-old modern human from western Siberia , 2014, Nature.

[9]  Swapan Mallick,et al.  Partial uracil–DNA–glycosylase treatment for screening of ancient DNA , 2015, Philosophical Transactions of the Royal Society B: Biological Sciences.

[10]  M. Stephens,et al.  Visualizing the structure of RNA-seq expression data using grade of membership models , 2017, PLoS genetics.

[11]  Swapan Mallick,et al.  Parallel paleogenomic transects reveal complex genetic history of early European farmers , 2017, Nature.

[12]  Jeffrey H. Miller,et al.  Mutagenic deamination of cytosine residues in DNA , 1980, Nature.

[13]  Michael DeGiorgio,et al.  A time transect of exomes from a Native American population before and after European contact , 2016, Nature Communications.

[14]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[15]  Philip L. F. Johnson,et al.  mapDamage2.0: fast approximate Bayesian estimates of ancient DNA damage parameters , 2013, Bioinform..

[16]  M. Jakobsson,et al.  Separating endogenous ancient DNA from modern day contamination in a Siberian Neandertal , 2014, Proceedings of the National Academy of Sciences.

[17]  Ivor Karavanić,et al.  The Genomic History of Southeastern Europe , 2017 .

[18]  N. Goldman,et al.  A codon-based model of nucleotide substitution for protein-coding DNA sequences. , 1994, Molecular biology and evolution.

[19]  M. Feldman,et al.  Genetic Structure of Human Populations , 2002, Science.

[20]  M. Stephens,et al.  A Simple Model-Based Approach to Inferring and Visualizing Cancer Mutation Signatures , 2015, bioRxiv.

[21]  Anders Albrechtsen,et al.  ANGSD: Analysis of Next Generation Sequencing Data , 2014, BMC Bioinformatics.

[22]  János Dani,et al.  Genome flux and stasis in a five millennium transect of European prehistory , 2014, Nature Communications.

[23]  Adrian W. Briggs,et al.  A High-Coverage Genome Sequence from an Archaic Denisovan Individual , 2012, Science.

[24]  Svante Pääbo,et al.  Temporal Patterns of Nucleotide Misincorporations and DNA Fragmentation in Ancient DNA , 2012, PloS one.

[25]  P. Jones,et al.  The rate of hydrolytic deamination of 5-methylcytosine in double-stranded DNA. , 1994, Nucleic acids research.

[26]  Philip L. F. Johnson,et al.  The complete genome sequence of a Neandertal from the Altai Mountains , 2013, Nature.

[27]  M. Hofreiter,et al.  A Paleogenomic Perspective on Evolution and Gene Function: New Insights from Ancient DNA , 2014, Science.

[28]  M. Slatkin,et al.  Joint Estimation of Contamination, Error and Demography for Nuclear DNA from Ancient Humans , 2015, bioRxiv.

[29]  Mattias Jakobsson,et al.  Genomic Diversity and Admixture Differs for Stone-Age Scandinavian Foragers and Farmers , 2014, Science.

[30]  Yong Wang,et al.  An Aboriginal Australian Genome Reveals Separate Human Dispersals into Asia , 2011, Science.

[31]  David H. Alexander,et al.  Fast model-based estimation of ancestry in unrelated individuals. , 2009, Genome research.

[32]  Jonathan Scott Friedlaender,et al.  A Human Genome Diversity Cell Line Panel , 2002, Science.

[33]  D. Reich,et al.  Genome-wide patterns of selection in 230 ancient Eurasians , 2015, Nature.

[34]  Janet Kelso,et al.  Schmutzi: estimation of contamination and endogenous mitochondrial consensus calling for ancient DNA , 2015, Genome Biology.

[35]  Swapan Mallick,et al.  The Beaker Phenomenon and the Genomic Transformation of Northwest Europe , 2017, bioRxiv.

[36]  D. Reich,et al.  The genetic history of Ice Age Europe , 2016, Nature.

[37]  Swapan Mallick,et al.  Genomic insights into the origin of farming in the ancient Near East , 2016, Nature.

[38]  M. Thomas P. Gilbert,et al.  mapDamage: testing for damage patterns in ancient DNA sequences , 2011, Bioinform..