Genome measures used for quality control are dependent on gene function and ancestry

MOTIVATION The transition/transversion (Ti/Tv) ratio and heterozygous/nonreference-homozygous (het/nonref-hom) ratio have been commonly computed in genetic studies as a quality control (QC) measurement. Additionally, these two ratios are helpful in our understanding of the patterns of DNA sequence evolution. RESULTS To thoroughly understand these two genomic measures, we performed a study using 1000 Genomes Project (1000G) released genotype data (N=1092). An additional two datasets (N=581 and N=6) were used to validate our findings from the 1000G dataset. We compared the two ratios among continental ancestry, genome regions and gene functionality. We found that the Ti/Tv ratio can be used as a quality indicator for single nucleotide polymorphisms inferred from high-throughput sequencing data. The Ti/Tv ratio varies greatly by genome region and functionality, but not by ancestry. The het/nonref-hom ratio varies greatly by ancestry, but not by genome regions and functionality. Furthermore, extreme guanine + cytosine content (either high or low) is negatively associated with the Ti/Tv ratio magnitude. Thus, when performing QC assessment using these two measures, care must be taken to apply the correct thresholds based on ancestry and genome region. Failure to take these considerations into account at the QC stage will bias any following analysis. CONTACT yan.guo@vanderbilt.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  R. Gibbs,et al.  Targeted enrichment beyond the consensus coding DNA sequence exome reveals exons with higher variant densities , 2011, Genome Biology.

[2]  C Saccone,et al.  Transition and transversion rate in the evolution of animal mitochondrial DNA. , 1986, Bio Systems.

[3]  Yan Guo,et al.  Three-stage quality control strategies for DNA re-sequencing data , 2014, Briefings Bioinform..

[4]  J. Long,et al.  Exome sequencing generates high quality data in non-target regions , 2012, BMC Genomics.

[5]  Jiang Li,et al.  The effect of strand bias in Illumina short-read sequencing data , 2012, BMC Genomics.

[6]  J. Oliver,et al.  A relationship between GC content and coding-sequence length , 1996, Journal of Molecular Evolution.

[7]  H. Hakonarson,et al.  ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data , 2010, Nucleic acids research.

[8]  Martin Renqiang Min,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[9]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[10]  Y. Benjamini,et al.  Summarizing and correcting the GC content bias in high-throughput sequencing , 2012, Nucleic acids research.

[11]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[12]  M. Rieder,et al.  Exome sequencing of extreme phenotypes identifies DCTN4 as a modifier of chronic Pseudomonas aeruginosa infection in cystic fibrosis , 2012, Nature Genetics.

[13]  R. Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[14]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[15]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[16]  W. Kruskal,et al.  Use of Ranks in One-Criterion Variance Analysis , 1952 .

[17]  J. Haines,et al.  Genome-wide association study identifies a novel breast cancer susceptibility locus at 6q25.1 , 2009, Nature Genetics.

[18]  Bo Peng,et al.  Variant association tools for quality control and analysis of large-scale sequence and genotyping array data. , 2014, American journal of human genetics.

[19]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[20]  Yan Guo,et al.  The use of next generation sequencing technology to study the effect of radiation therapy on mitochondrial DNA mutation. , 2012, Mutation research.

[21]  Jiang Li,et al.  Multi-perspective quality control of Illumina exome sequencing data using QC3. , 2014, Genomics.

[22]  R. B. Azevedo,et al.  On the Immortality of Television Sets: “Function” in the Human Genome According to the Evolution-Free Gospel of ENCODE , 2013, Genome biology and evolution.

[23]  Juliane C. Dohm,et al.  Substantial biases in ultra-short read data sets from high-throughput DNA sequencing , 2008, Nucleic acids research.

[24]  R. Nielsen,et al.  Synonymous and nonsynonymous rate variation in nuclear genes of mammals , 1998, Journal of Molecular Evolution.