De Novo Assembly of Two Swedish Genomes Reveals Missing Segments from the Human GRCh38 Reference and Improves Variant Calling of Population-Scale Sequencing Data

The current human reference sequence (GRCh38) is a foundation for large-scale sequencing projects. However, recent studies have suggested that GRCh38 may be incomplete and give a suboptimal representation of specific population groups. Here, we performed a de novo assembly of two Swedish genomes that revealed over 10 Mb of sequences absent from the human GRCh38 reference in each individual. Around 6 Mb of these novel sequences (NS) are shared with a Chinese personal genome. The NS are highly repetitive, have an elevated GC-content, and are primarily located in centromeric or telomeric regions. Up to 1 Mb of NS can be assigned to chromosome Y, and large segments are also missing from GRCh38 at chromosomes 14, 17, and 21. Inclusion of NS into the GRCh38 reference radically improves the alignment and variant calling from short-read whole-genome sequencing data at several genomic loci. A re-analysis of a Swedish population-scale sequencing project yields > 75,000 putative novel single nucleotide variants (SNVs) and removes > 10,000 false positive SNV calls per individual, some of which are located in protein coding regions. Our results highlight that the GRCh38 reference is not yet complete and demonstrate that personal genome assemblies from local populations can improve the analysis of short-read whole-genome sequencing data.

[1]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration , 2012, Briefings Bioinform..

[2]  Russell E. Durrett,et al.  Assembly and diploid architecture of an individual human genome via single-molecule technologies , 2015, Nature Methods.

[3]  Pall I. Olason,et al.  SweGen: a whole-genome data resource of genetic variability in a cross-section of the Swedish population , 2017, European Journal of Human Genetics.

[4]  Adam Ameur,et al.  Single-Molecule Sequencing: Towards Clinical Applications. , 2019, Trends in biotechnology.

[5]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[6]  Pui-Yan Kwok,et al.  De novo human genome assemblies reveal spectrum of alternative haplotypes in diverse populations , 2018, Nature Communications.

[7]  Steven G. Schroeder,et al.  Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome , 2017, Nature Genetics.

[8]  E. Eichler,et al.  Long-read sequencing and de novo assembly of a Chinese genome , 2016, Nature Communications.

[9]  Steven Salzberg,et al.  Faculty Opinions recommendation of Sequencing and de novo assembly of 150 genomes from Denmark as a population reference. , 2018, Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature.

[10]  David Reich,et al.  The promise of disease gene discovery in South Asia , 2017, Nature Genetics.

[11]  S. Koren,et al.  Nanopore sequencing and assembly of a human genome with ultra-long reads , 2017, bioRxiv.

[12]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[13]  Peng Chen,et al.  Deep whole-genome sequencing of 100 southeast Asian Malays. , 2013, American journal of human genetics.

[14]  Tom R. Gaunt,et al.  The UK10K project identifies rare variants in health and disease , 2016 .

[15]  Mark J. P. Chaisson,et al.  Resolving the complexity of the human genome using single-molecule sequencing , 2014, Nature.

[16]  Pieter B. T. Neerincx,et al.  The Genome of the Netherlands: design, and project goals , 2013, European Journal of Human Genetics.

[17]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[18]  Wen J. Li,et al.  Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation , 2015, Nucleic Acids Res..

[19]  Bjarni V. Halldórsson,et al.  Large-scale whole-genome sequencing of the Icelandic population , 2015, Nature Genetics.

[20]  N. Lennon,et al.  Characterizing and measuring bias in sequence data , 2013, Genome Biology.

[21]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[22]  D. Conrad,et al.  Global variation in copy number in the human genome , 2006, Nature.

[23]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[24]  Yun Li,et al.  One Size Doesn't Fit All - RefEditor: Building Personalized Diploid Reference Genome to Improve Read Mapping and Genotype Calling in Next Generation Sequencing Studies , 2015, PLoS Comput. Biol..

[25]  Levi C. T. Pierce,et al.  Deep sequencing of 10,000 human genomes , 2016, Proceedings of the National Academy of Sciences.

[26]  H. Hakonarson,et al.  ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data , 2010, Nucleic acids research.

[27]  T. Santarius,et al.  The genome of the sparganosis tapeworm Spirometra erinaceieuropaei isolated from the biopsy of a migrating brain lesion , 2014, Genome Biology.

[28]  Michael C. Schatz,et al.  Accurate detection of complex structural variations using single molecule sequencing , 2017, Nature Methods.

[29]  M. Schatz,et al.  Phased diploid genome assembly with single-molecule real-time sequencing , 2016, Nature Methods.

[30]  R. Durbin,et al.  Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly , 2016, bioRxiv.

[31]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[32]  P. Kwok,et al.  A Hybrid Approach for de novo Human Genome Sequence Assembly and Phasing , 2016, Nature Methods.

[33]  Mahmoud Zirie,et al.  The Qatar genome: a population-specific tool for precision medicine in the Middle East , 2016, Human Genome Variation.

[34]  Paul Flicek,et al.  Alignment of 1000 Genomes Project reads to reference assembly GRCh38 , 2017, GigaScience.

[35]  Aaron A. Klammer,et al.  Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data , 2013, Nature Methods.

[36]  J. Korlach,et al.  De novo assembly and phasing of a Korean human genome , 2016, Nature.