Improvements and impacts of GRCh38 human reference on high throughput sequencing data analysis.

Analyses of high throughput sequencing data starts with alignment against a reference genome, which is the foundation for all re-sequencing data analyses. Each new release of the human reference genome has been augmented with improved accuracy and completeness. It is presumed that the latest release of human reference genome, GRCh38 will contribute more to high throughput sequencing data analysis by providing more accuracy. But the amount of improvement has not yet been quantified. We conducted a study to compare the genomic analysis results between the GRCh38 reference and its predecessor GRCh37. Through analyses of alignment, single nucleotide polymorphisms, small insertion/deletions, copy number and structural variants, we show that GRCh38 offers overall more accurate analysis of human sequencing data. More importantly, GRCh38 produced fewer false positive structural variants. In conclusion, GRCh38 is an improvement over GRCh37 not only from the genome assembly aspect, but also yields more reliable genomic analysis results.

[1]  R. Handsaker,et al.  Large multi-allelic copy number variations in humans , 2015, Nature Genetics.

[2]  Yan Guo,et al.  Detection of internal exon deletion with exon Del , 2014, BMC Bioinformatics.

[3]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[4]  M. Stephens,et al.  RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. , 2008, Genome research.

[5]  J. Long,et al.  Exome sequencing generates high quality data in non-target regions , 2012, BMC Genomics.

[6]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[7]  Nicolas Altemose,et al.  Centromere reference models for human chromosomes X and Y satellite arrays , 2013, Genome research.

[8]  Y. Shyr,et al.  Mitochondria single nucleotide variation across six blood cell types. , 2016, Mitochondrion.

[9]  Yan Guo,et al.  High-throughput sequencing in mitochondrial DNA research. , 2014, Mitochondrion.

[10]  Eric J Duncavage,et al.  Detection of structural DNA variation from next generation sequencing data: a review of informatic approaches. , 2013, Cancer genetics.

[11]  D. Turnbull,et al.  Reanalysis and revision of the Cambridge reference sequence for human mitochondrial DNA , 1999, Nature Genetics.

[12]  J. Landolin,et al.  Assembling large genomes with single-molecule sequencing and locality-sensitive hashing , 2014, Nature Biotechnology.

[13]  W. Martin,et al.  Molecular Poltergeists: Mitochondrial DNA Copies (numts) in Sequenced Nuclear Genomes , 2010, PLoS genetics.

[14]  S. Hochreiter,et al.  cn.MOPS: mixture of Poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate , 2012, Nucleic acids research.

[15]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[16]  H. Hakonarson,et al.  ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data , 2010, Nucleic acids research.

[17]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[18]  Jiang Li,et al.  MitoSeek: extracting mitochondria information and performing high-throughput mitochondria sequencing analysis , 2013, Bioinform..

[19]  René L. Warren,et al.  Sealer: a scalable gap-closing application for finishing draft genomes , 2015, BMC Bioinformatics.

[20]  Yan Guo,et al.  The use of next generation sequencing technology to study the effect of radiation therapy on mitochondrial DNA mutation. , 2012, Mutation research.

[21]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[22]  R. Wilson,et al.  BreakDancer: An algorithm for high resolution mapping of genomic structural variation , 2009, Nature Methods.

[23]  M. Dzugutov,et al.  addendum: A universal scaling law for atomic diffusion in condensed matter , 2001, Nature.

[24]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[25]  Sergey Koren,et al.  Corrigendum: Assembling large genomes with single-molecule sequencing and locality-sensitive hashing , 2015, Nature Biotechnology.

[26]  M. Stoneking,et al.  Fidelity of capture-enrichment for mtDNA genome sequencing: influence of NUMTs , 2012, Nucleic acids research.

[27]  L. S. Cram,et al.  A highly conserved repetitive DNA sequence, (TTAGGG)n, present at the telomeres of human chromosomes. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Y. Shyr,et al.  Practicability of detecting somatic point mutation from RNA high throughput sequencing data. , 2016, Genomics.

[29]  Jiang Li,et al.  Finding the lost treasures in exome sequencing data. , 2013, Trends in genetics : TIG.

[30]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[31]  S. Salzberg,et al.  Hierarchical scaffolding with Bambus. , 2003, Genome research.

[32]  Jiang Li,et al.  Large Scale Comparison of Gene Expression Levels by Microarrays and RNAseq Using TCGA Data , 2013, PloS one.

[33]  Nancy F. Hansen,et al.  Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[34]  Yan Guo,et al.  Three-stage quality control strategies for DNA re-sequencing data , 2014, Briefings Bioinform..

[35]  Jiang Li,et al.  Multi-perspective quality control of Illumina exome sequencing data using QC3. , 2014, Genomics.

[36]  David I. Smith,et al.  3' tag digital gene expression profiling of human brain and universal reference RNA using Illumina Genome Analyzer , 2009, BMC Genomics.

[37]  Dawei Li,et al.  The sequence and de novo assembly of the giant panda genome , 2010, Nature.

[38]  Yan Guo,et al.  Comparative Study of Exome Copy Number Variation Estimation Tools Using Array Comparative Genomic Hybridization as Control , 2013, BioMed research international.

[39]  S. Ranade,et al.  Stem cell transcriptome profiling via massive-scale mRNA sequencing , 2008, Nature Methods.

[40]  Pan Zhang,et al.  Mitochondria sequence mapping strategies and practicability of mitochondria variant detection from exome and RNA sequencing data , 2016, Briefings Bioinform..

[41]  Shyr Yu,et al.  Genome measures used for quality control are dependent on gene function and ancestry , 2015, Bioinform..