Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly

The human reference genome assembly plays a central role in nearly all aspects of today's basic and clinical research. GRCh38 is the first coordinate-changing assembly update since 2009; it reflects the resolution of roughly 1000 issues and encompasses modifications ranging from thousands of single base changes to megabase-scale path reorganizations, gap closures, and localization of previously orphaned sequences. We developed a new approach to sequence generation for targeted base updates and used data from new genome mapping technologies and single haplotype resources to identify and resolve larger assembly issues. For the first time, the reference assembly contains sequence-based representations for the centromeres. We also expanded the number of alternate loci to create a reference that provides a more robust representation of human population variation. We demonstrate that the updates render the reference an improved annotation substrate, alter read alignments in unchanged regions, and impact variant interpretation at clinically relevant loci. We additionally evaluated a collection of new de novo long-read haploid assemblies and conclude that although the new assemblies compare favorably to the reference with respect to continuity, error rate, and gene completeness, the reference still provides the best representation for complex genomic regions and coding sequences. We assert that the collected updates in GRCh38 make the newer assembly a more robust substrate for comprehensive analyses that will promote our understanding of human biology and advance our efforts to improve health.

[1]  P Green,et al.  Base-calling of automated sequencer traces using phred. II. Error probabilities. , 1998, Genome research.

[2]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[3]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[4]  B. Trask,et al.  Segmental duplications: organization and impact within the current human genome project assembly. , 2001, Genome research.

[5]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[6]  Pui-Yan Kwok,et al.  Paternal origins of complete hydatidiform moles proven by whole genome single-nucleotide polymorphism haplotyping. , 2002, Genomics.

[7]  Martin J. Pollard,et al.  The complete sequence of human chromosome 5 , 2004 .

[8]  Paul Richardson,et al.  The DNA sequence and comparative analysis of human chromosome 5 , 2004, Nature.

[9]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[10]  International Human Genome Sequencing Consortium Finishing the euchromatic sequence of the human genome , 2004 .

[11]  M. Olivier A haplotype map of the human genome , 2003, Nature.

[12]  E. Eichler,et al.  Segmental duplications and copy-number variation in the human genome. , 2005, American journal of human genetics.

[13]  M. Olivier A haplotype map of the human genome. , 2003, Nature.

[14]  Timothy B. Stockwell,et al.  The Diploid Genome Sequence of an Individual Human , 2007, PLoS biology.

[15]  James G. R. Gilbert,et al.  Variation analysis and gene annotation of eight MHC haplotypes: The MHC Haplotype Project , 2008, Immunogenetics.

[16]  Zhaoshi Jiang,et al.  Evolutionary toggling of the MAPT 17q21.31 inversion region , 2008, Nature Genetics.

[17]  Joshua M. Korn,et al.  Mapping and sequencing of structural variation from eight human genomes , 2008, Nature.

[18]  Fengtang Yang,et al.  Adaptive evolution of UGT2B17 copy-number variation. , 2008, American journal of human genetics.

[19]  Dawei Li,et al.  The diploid genome sequence of an Asian individual , 2008, Nature.

[20]  Lars Bolund,et al.  Building the sequence map of the human pan-genome , 2010, Nature Biotechnology.

[21]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[22]  Peter H. Sudmant,et al.  Diversity of Human Copy Number Variation and Multicopy Genes , 2010, Science.

[23]  Philip L. F. Johnson,et al.  A Draft Sequence of the Neandertal Genome , 2010, Science.

[24]  David C. Schwartz,et al.  High-resolution human genome structure by single-molecule analysis , 2010, Proceedings of the National Academy of Sciences.

[25]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[26]  Peter Parham,et al.  Different Patterns of Evolution in the Centromeric and Telomeric Regions of Group A and B Haplotypes of the Human Killer Cell Ig-Like Receptor Locus , 2010, PloS one.

[27]  Markus Hsi-Yang Fritz,et al.  Efficient storage of high throughput DNA sequencing data using reference-based compression. , 2011, Genome research.

[28]  Nuno A. Fonseca,et al.  Assemblathon 1: a competitive assessment of de novo short read assembly methods. , 2011, Genome research.

[29]  S. Salzberg,et al.  Genome Assembly Has a Major Impact on Gene Content: A Comparison of Annotation in Two Bos Taurus Assemblies , 2011, PloS one.

[30]  R. Wilson,et al.  Modernizing Reference Genome Assemblies , 2011, PLoS biology.

[31]  Todd M. Smith,et al.  Limitations of the Human Reference Genome for Personalized Genomics , 2012, PloS one.

[32]  Bud Mishra,et al.  Reevaluating Assembly Evaluations with Feature Response Curves: GAGE and Assemblathons , 2012, PloS one.

[33]  Gabor T. Marth,et al.  Haplotype-based variant detection from short-read sequencing , 2012, 1207.3907.

[34]  Peter H. Sudmant,et al.  Evolution of Human-Specific Neural SRGAP2 Genes by Incomplete Segmental Duplication , 2012, Cell.

[35]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[36]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[37]  R. Durbin,et al.  Efficient de novo assembly of large genomes using compressed data structures. , 2012, Genome research.

[38]  Yongjun Zhao,et al.  DNA template strand sequencing of single-cells maps genomic rearrangements at high resolution , 2012, Nature Methods.

[39]  David Haussler,et al.  HAL: a hierarchical format for storing and analyzing multiple genome alignments , 2013, Bioinform..

[40]  Heng Li,et al.  Mapping the human reference genome's missing sequence by three-way admixture in Latino genomes. , 2013, American journal of human genetics.

[41]  T. Graves,et al.  Independent specialization of the human and mouse X chromosomes for the male germline , 2013, Nature Genetics.

[42]  Jamie K. Scott,et al.  Complete haplotype sequence of the human immunoglobulin heavy-chain variable, diversity, and joining genes and characterization of allelic and copy-number variation. , 2013, American journal of human genetics.

[43]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[44]  Alkes L. Price,et al.  Using population admixture to help complete maps of the human genome , 2013, Nature Genetics.

[45]  Nora Husain,et al.  Clone DB: an integrated NCBI resource for clone-associated data , 2012, Nucleic Acids Res..

[46]  Inanç Birol,et al.  Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species , 2013, GigaScience.

[47]  Aaron A. Klammer,et al.  Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data , 2013, Nature Methods.

[48]  Stephan J Sanders,et al.  A framework for the interpretation of de novo mutation in human disease , 2014, Nature Genetics.

[49]  Peter H. Sudmant,et al.  Palindromic GOLGA8 core duplicons promote chromosome 15q13.3 microdeletion and evolutionary instability , 2014, Nature Genetics.

[50]  Nicolas Altemose,et al.  Centromere reference models for human chromosomes X and Y satellite arrays , 2013, Genome research.

[51]  Sergey A. Shiryev,et al.  Single haplotype assembly of the human genome from a hydatidiform mole , 2014, bioRxiv.

[52]  Mark J. P. Chaisson,et al.  Reconstructing complex regions of genomes using long-read sequencing technology , 2014, Genome research.

[53]  Christina A. Cuomo,et al.  Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement , 2014, PloS one.

[54]  Jing Wang,et al.  CrossMap: a versatile tool for coordinate conversion between genome assemblies , 2014, Bioinform..

[55]  Mauro Maggioni,et al.  Genomic Characterization of Large Heterochromatic Gaps in the Human Genome Assembly , 2014, PLoS Comput. Biol..

[56]  Adam M. Novak,et al.  Mapping to a Reference Genome Structure , 2014, 1404.5010.

[57]  Heng Li,et al.  Toward better understanding of artifacts in variant calling from high-coverage samples , 2014, Bioinform..

[58]  David Haussler,et al.  Building a Pangenome Reference for a Population , 2014, RECOMB.

[59]  J. Zook,et al.  Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls , 2013, Nature Biotechnology.

[60]  Deanna M. Church,et al.  ClinVar: public archive of relationships among sequence variation and human phenotype , 2013, Nucleic Acids Res..

[61]  J. Landolin,et al.  Assembling large genomes with single-molecule sequencing and locality-sensitive hashing , 2014, Nature Biotechnology.

[62]  Karen H. Miga,et al.  Utilizing mapping targets of sequences underrepresented in the reference assembly to reduce false positive alignments , 2015, Nucleic acids research.

[63]  Caleb F. Davis,et al.  Assessing structural variation in a personal genome—towards a human reference diploid genome , 2015, BMC Genomics.

[64]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[65]  Mark J. P. Chaisson,et al.  Resolving the complexity of the human genome using single-molecule sequencing , 2014, Nature.

[66]  Jian Wang,et al.  De novo assembly of a haplotype-resolved human genome , 2015, Nature Biotechnology.

[67]  David Haussler,et al.  Building a Pan-Genome Reference for a Population , 2015, J. Comput. Biol..

[68]  Heng Li,et al.  FermiKit: assembly-based variant calling for Illumina resequencing data , 2015, Bioinform..

[69]  Gil McVean,et al.  Improved genome inference in the MHC using a population reference graph , 2014, Nature Genetics.

[70]  Russell E. Durrett,et al.  Assembly and diploid architecture of an individual human genome via single-molecule technologies , 2015, Nature Methods.

[71]  Daphne Koller,et al.  Sharing and Specificity of Co-expression Networks across 35 Human Tissues , 2014, PLoS Comput. Biol..

[72]  Richard Durbin,et al.  Extending reference assembly models , 2015, Genome Biology.

[73]  Jonathan M D Wood,et al.  Using optical mapping data for the improvement of vertebrate genome assemblies , 2015, GigaScience.

[74]  E. Eichler,et al.  Sequencing of the human IG light chain loci from a hydatidiform mole BAC library reveals locus-specific signatures of genetic diversity , 2014, Genes and Immunity.

[75]  Evan E. Eichler,et al.  Genetic variation and the de novo assembly of human genomes , 2015, Nature Reviews Genetics.

[76]  K. Ohi,et al.  A Naturally Occurring Null Variant of the NMDA Type Glutamate Receptor NR3B Subunit Is a Risk Factor of Schizophrenia , 2015, PloS one.

[77]  Deanna M. Church,et al.  Assembly: a resource for assembled genomes at NCBI , 2015, Nucleic Acids Res..

[78]  Wen J. Li,et al.  Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation , 2015, Nucleic Acids Res..

[79]  E. Eichler,et al.  Long-read sequencing and de novo assembly of a Chinese genome , 2016, Nature Communications.

[80]  M. Schatz,et al.  Phased diploid genome assembly with single-molecule real-time sequencing , 2016, Nature Methods.

[81]  Alexa B. R. McIntyre,et al.  Extensive sequencing of seven human genomes to characterize benchmark reference materials , 2015, Scientific Data.

[82]  J. Korlach,et al.  De novo assembly and phasing of a Korean human genome , 2016, Nature.

[83]  Benedict Paten,et al.  A graph extension of the positional Burrows–Wheeler transform and its applications , 2016, Algorithms for Molecular Biology.