Deep sequencing of 10,000 human genomes

Significance Large-scale initiatives toward personalized medicine are driving a massive expansion in the number of human genomes being sequenced. Therefore, there is an urgent need to define quality standards for clinical use. This includes deep coverage and sequencing accuracy of an individual’s genome. Our work represents the largest effort to date in sequencing human genomes at deep coverage with these new standards. This study identifies over 150 million human variants, a majority of them rare and unknown. Moreover, these data identify sites in the genome that are highly intolerant to variation—possibly essential for life or health. We conclude that high-coverage genome sequencing provides accurate detail on human variation for discovery and clinical applications. We report on the sequencing of 10,545 human genomes at 30×–40× coverage with an emphasis on quality metrics and novel variant and sequence discovery. We find that 84% of an individual human genome can be sequenced confidently. This high-confidence region includes 91.5% of exon sequence and 95.2% of known pathogenic variant positions. We present the distribution of over 150 million single-nucleotide variants in the coding and noncoding genome. Each newly sequenced genome contributes an average of 8,579 novel variants. In addition, each genome carries on average 0.7 Mb of sequence that is not found in the main build of the hg38 reference genome. The density of this catalog of variation allowed us to construct high-resolution profiles that define genomic sites that are highly intolerant of genetic variation. These results indicate that the data generated by deep genome sequencing is of the quality necessary for clinical use.

[1]  Tomas W. Fitzgerald,et al.  Origins and functional impact of copy number variation in the human genome , 2010, Nature.

[2]  Alan M. Kwong,et al.  Genome sequencing elucidates Sardinian genetic architecture and augments association analyses for lipid and blood inflammatory markers , 2015, Nature Genetics.

[3]  Alexa B. R. McIntyre,et al.  Extensive sequencing of seven human genomes to characterize benchmark reference materials , 2015, Scientific Data.

[4]  David H. Alexander,et al.  Fast model-based estimation of ancestry in unrelated individuals. , 2009, Genome research.

[5]  Morris Swertz,et al.  Genome-wide patterns and properties of de novo mutations in humans , 2015, Nature Genetics.

[6]  Pablo Cingolani,et al.  © 2012 Landes Bioscience. Do not distribute. , 2022 .

[7]  Tom R. Gaunt,et al.  The UK10K project identifies rare variants in health and disease , 2016 .

[8]  Carson C Chow,et al.  Second-generation PLINK: rising to the challenge of larger and richer datasets , 2014, GigaScience.

[9]  Lei Shang,et al.  Whole-genome sequencing is more powerful than whole-exome sequencing for detecting exome variants , 2014, Proceedings of the National Academy of Sciences.

[10]  Thomas Zichner,et al.  DELLY: structural variant discovery by integrated paired-end and split-read analysis , 2012, Bioinform..

[11]  Paul J. McLaren,et al.  The Characteristics of Heterozygous Protein Truncating Variants in the Human Genome , 2015, PLoS Comput. Biol..

[12]  Chao Xie,et al.  Fast and sensitive protein alignment using DIAMOND , 2014, Nature Methods.

[13]  Adrian W. Briggs,et al.  A High-Coverage Genome Sequence from an Archaic Denisovan Individual , 2012, Science.

[14]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[15]  G. Abecasis,et al.  Detecting and estimating contamination of human DNA samples in sequencing and array-based genotype data. , 2012, American journal of human genetics.

[16]  Hongzhe Li,et al.  MATCHCLIP: locate precise breakpoints for copy number variation using CIGAR string by matching soft clipped reads , 2013, Front. Genet..

[17]  Euan A. Ashley,et al.  Medical implications of technical accuracy in genome sequencing , 2016, Genome Medicine.

[18]  L. Cavalli-Sforza Human evolution and its relevance for genetic epidemiology. , 2007, Annual review of genomics and human genetics.

[19]  M. Gerstein,et al.  CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. , 2011, Genome research.

[20]  Neville E. Sanjana,et al.  Genome-Scale CRISPR-Cas9 Knockout Screening in Human Cells , 2014, Science.

[21]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[22]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[23]  Pieter B. T. Neerincx,et al.  Supplementary Information Whole-genome sequence variation , population structure and demographic history of the Dutch population , 2022 .

[24]  Benjamin F. Voight,et al.  Nature Genetics Advance Online Publication a N a Ly S I S an Expanded Sequence Context Model Broadly Explains Variability in Polymorphism Levels across the Human Genome , 2022 .

[25]  Andrew J. Hill,et al.  Analysis of protein-coding genetic variation in 60,706 humans , 2015, bioRxiv.

[26]  S Karlin,et al.  Genome-scale compositional comparisons in eukaryotes. , 2001, Genome research.

[27]  Richard W. Lusk Diverse and Widespread Contamination Evident in the Unmapped Depths of High Throughput Sequencing Data , 2014, bioRxiv.

[28]  Semyon Kruglyak,et al.  Isaac: ultra-fast whole-genome secondary analysis on Illumina sequencing platforms , 2013, Bioinform..

[29]  G. Abecasis,et al.  Sequencing Y Chromosomes Resolves Discrepancy in Time to Common Ancestor of Males Versus Females , 2013, Science.

[30]  Neva C. Durand,et al.  A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping , 2014, Cell.

[31]  Matthew S. Lebo,et al.  Performance of ACMG-AMP Variant-Interpretation Guidelines among Nine Laboratories in the Clinical Sequencing Exploratory Research Consortium. , 2016, American journal of human genetics.

[32]  J. Shendure,et al.  A general framework for estimating the relative pathogenicity of human genetic variants , 2014, Nature Genetics.

[33]  S. Hochreiter,et al.  cn.MOPS: mixture of Poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate , 2012, Nucleic acids research.

[34]  Stephen F. Schaffner,et al.  The X chromosome in population genetics , 2004, Nature Reviews Genetics.

[35]  James Y. Zou Analysis of protein-coding genetic variation in 60,706 humans , 2015, Nature.

[36]  Eric E Schadt,et al.  Analytical validation of whole exome and whole genome sequencing for clinical applications , 2014, BMC Medical Genomics.

[37]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[38]  Kai Ye,et al.  Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads , 2009, Bioinform..

[39]  Ryan M. Layer,et al.  LUMPY: a probabilistic framework for structural variant discovery , 2012, Genome Biology.

[40]  Josyf Mychaleckyj,et al.  Robust relationship inference in genome-wide association studies , 2010, Bioinform..

[41]  Kengo Kinoshita,et al.  Rare variant discovery by deep whole-genome sequencing of 1,070 Japanese individuals , 2015, Nature Communications.

[42]  P. Stenson,et al.  The Human Gene Mutation Database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine , 2013, Human Genetics.

[43]  Bjarni V. Halldórsson,et al.  Large-scale whole-genome sequencing of the Icelandic population , 2015, Nature Genetics.

[44]  Deanna M. Church,et al.  ClinVar: public archive of relationships among sequence variation and human phenotype , 2013, Nucleic Acids Res..

[45]  Bale,et al.  Standards and Guidelines for the Interpretation of Sequence Variants: A Joint Consensus Recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology , 2015, Genetics in Medicine.

[46]  John Novembre,et al.  The influence of genomic context on mutation patterns in the human genome inferred from rare variants , 2013, Genome research.

[47]  R. Wilson,et al.  BreakDancer: An algorithm for high resolution mapping of genomic structural variation , 2009, Nature Methods.

[48]  G. Ast,et al.  Alternative splicing and evolution: diversification, exon definition and function , 2010, Nature Reviews Genetics.

[49]  Melissa A. Wilson Sayres,et al.  Natural Selection Reduced Diversity on Human Y Chromosomes , 2013, PLoS genetics.

[50]  Xiaoyu Chen,et al.  Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications , 2016, Bioinform..

[51]  L. Hurst,et al.  Evidence for a selectively favourable reduction in the mutation rate of the X chromosome , 1997, Nature.

[52]  Russell E. Durrett,et al.  Assembly and diploid architecture of an individual human genome via single-molecule technologies , 2015, Nature Methods.

[53]  Steve Lee,et al.  Canvas: versatile and scalable detection of copy number variants , 2016, bioRxiv.

[54]  E. Zeggini,et al.  The African Genome Variation Project shapes medical genetics in Africa , 2014, Nature.

[55]  P. Flicek,et al.  The Ensembl Regulatory Build , 2015, Genome Biology.

[56]  Philip L. F. Johnson,et al.  The complete genome sequence of a Neanderthal from the Altai Mountains , 2013 .

[57]  Claudio J. Verzilli,et al.  An Abundance of Rare Functional Variants in 202 Drug Target Genes Sequenced in 14,002 People , 2012, Science.

[58]  Hongkai Ji,et al.  Why do human diversity levels vary at a megabase scale? , 2005, Genome research.

[59]  R. Handsaker,et al.  Large multi-allelic copy number variations in humans , 2015, Nature Genetics.

[60]  D. Karolchik,et al.  The UCSC Genome Browser database: 2016 update , 2015, bioRxiv.

[61]  M. Gerstein,et al.  Variation in Transcription Factor Binding Among Humans , 2010, Science.

[62]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[63]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[64]  Judith A. Blake,et al.  The Mouse Genome Database (MGD): facilitating mouse as a model for human biology and disease , 2014, Nucleic Acids Res..

[65]  Kelly Servick Bioinformatics. Top contenders blast Pentagon's new bioterror detection prize. , 2013, Science.