Comprehensive Characterization of Human Genome Variation by High Coverage Whole-Genome Sequencing of Forty Four Caucasians

Whole genome sequencing studies are essential to obtain a comprehensive understanding of the vast pattern of human genomic variations. Here we report the results of a high-coverage whole genome sequencing study for 44 unrelated healthy Caucasian adults, each sequenced to over 50-fold coverage (averaging 65.8×). We identified approximately 11 million single nucleotide polymorphisms (SNPs), 2.8 million short insertions and deletions, and over 500,000 block substitutions. We showed that, although previous studies, including the 1000 Genomes Project Phase 1 study, have catalogued the vast majority of common SNPs, many of the low-frequency and rare variants remain undiscovered. For instance, approximately 1.4 million SNPs and 1.3 million short indels that we found were novel to both the dbSNP and the 1000 Genomes Project Phase 1 data sets, and the majority of which (∼96%) have a minor allele frequency less than 5%. On average, each individual genome carried ∼3.3 million SNPs and ∼492,000 indels/block substitutions, including approximately 179 variants that were predicted to cause loss of function of the gene products. Moreover, each individual genome carried an average of 44 such loss-of-function variants in a homozygous state, which would completely “knock out” the corresponding genes. Across all the 44 genomes, a total of 182 genes were “knocked-out” in at least one individual genome, among which 46 genes were “knocked out” in over 30% of our samples, suggesting that a number of genes are commonly “knocked-out” in general populations. Gene ontology analysis suggested that these commonly “knocked-out” genes are enriched in biological process related to antigen processing and immune response. Our results contribute towards a comprehensive characterization of human genomic variation, especially for less-common and rare variants, and provide an invaluable resource for future genetic studies of human variation and diseases.

[1]  Dmitry Pushkarev,et al.  Single-molecule sequencing of an individual human genome , 2009, Nature Biotechnology.

[2]  Ken Chen,et al.  Recurring mutations found by sequencing an acute myeloid leukemia genome. , 2009, The New England journal of medicine.

[3]  Elizabeth T. Cirulli,et al.  The Characterization of Twenty Sequenced Human Genomes , 2010, PLoS genetics.

[4]  Taishin Kin,et al.  Idiographica: a general-purpose web application to build idiograms on-demand for human, mouse and rat , 2007, Bioinform..

[5]  Robert B. Hartlage,et al.  This PDF file includes: Materials and Methods , 2009 .

[6]  Dawei Li,et al.  The diploid genome sequence of an Asian individual , 2008, Nature.

[7]  Jacob A. Tennessen,et al.  Evolution and Functional Impact of Rare Coding Variation from Deep Sequencing of Human Exomes , 2012, Science.

[8]  P. Shannon,et al.  Analysis of Genetic Inheritance in a Family Quartet by Whole-Genome Sequencing , 2010, Science.

[9]  Masao Nagasaki,et al.  Whole-genome sequencing and comprehensive variant analysis of a Japanese individual using massively parallel sequencing , 2010, Nature Genetics.

[10]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[11]  Timothy B. Stockwell,et al.  The Diploid Genome Sequence of an Individual Human , 2007, PLoS biology.

[12]  Israel Steinfeld,et al.  BMC Bioinformatics BioMed Central , 2008 .

[13]  Chao Xie,et al.  CNV-seq, a new method to detect copy number variation using high-throughput sequencing , 2009, BMC Bioinformatics.

[14]  P. Stankiewicz,et al.  Whole-genome sequencing in a patient with Charcot-Marie-Tooth neuropathy. , 2010, The New England journal of medicine.

[15]  Anil K. Malhotra,et al.  Novel multi-nucleotide polymorphisms in the human genome characterized by whole genome and exome sequencing , 2010, Nucleic acids research.

[16]  N. Friel,et al.  Sequencing and analysis of an Irish human genome , 2010 .

[17]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[18]  Amy E. Hawkins,et al.  DNA sequencing of a cytogenetically normal acute myeloid leukemia genome , 2008, Nature.

[19]  Elizabeth T. Cirulli,et al.  Whole-Genome Sequencing of a Single Proband Together with Linkage Analysis Identifies a Mendelian Disease Gene , 2010, PLoS genetics.

[20]  Claudio J. Verzilli,et al.  An Abundance of Rare Functional Variants in 202 Drug Target Genes Sequenced in 14,002 People , 2012, Science.

[21]  H. P. Kang,et al.  Extensive genomic and transcriptional diversity identified through massively parallel DNA and RNA sequencing of eighteen Korean individuals , 2011, Nature Genetics.

[22]  Kenny Q. Ye,et al.  Mapping copy number variation by population scale genome sequencing , 2010, Nature.

[23]  Joseph K. Pickrell,et al.  A Systematic Survey of Loss-of-Function Variants in Human Protein-Coding Genes , 2012, Science.

[24]  D. Turnbull,et al.  Reanalysis and revision of the Cambridge reference sequence for human mitochondrial DNA , 1999, Nature Genetics.

[25]  Francisco M. De La Vega,et al.  Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. , 2009, Genome research.

[26]  Sangsoo Kim,et al.  The first Korean genome sequence and analysis: full genome sequencing for a socio-ethnic group. , 2009, Genome research.

[27]  C. Carlson,et al.  Principles for the post-GWAS functional characterization of cancer risk loci , 2011, Nature Genetics.

[28]  Thomas D. Wu,et al.  A highly annotated whole-genome sequence of a Korean individual , 2009, Nature.

[29]  J. Lupski,et al.  The complete genome of an individual by massively parallel DNA sequencing , 2008, Nature.

[30]  F. Collins,et al.  Genomic medicine--an updated primer. , 2010, The New England journal of medicine.

[31]  C. Kingsley Identification of causal sequence variants of disease in the next generation sequencing era. , 2011, Methods in molecular biology.

[32]  D. MacArthur,et al.  Loss-of-function variants in the genomes of healthy humans. , 2010, Human molecular genetics.

[33]  Mark Gerstein,et al.  VAT: a computational framework to functionally annotate variants in personal genomes within a cloud-computing environment , 2012, Bioinform..

[34]  L. Feuk,et al.  Development of bioinformatics resources for display and analysis of copy number and other structural variants in the human genome , 2006, Cytogenetic and Genome Research.

[35]  F. Collins,et al.  Potential etiologic and functional implications of genome-wide association loci for human diseases and traits , 2009, Proceedings of the National Academy of Sciences.

[36]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[37]  Judy H. Cho,et al.  Finding the missing heritability of complex diseases , 2009, Nature.

[38]  A. Sparks,et al.  The mutation spectrum revealed by paired genome sequences from a lung cancer patient , 2010, Nature.

[39]  Jason H. Moore,et al.  Missing heritability and strategies for finding the underlying causes of complex disease , 2010, Nature Reviews Genetics.

[40]  Nancy F. Hansen,et al.  Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[41]  Francis S Collins,et al.  A HapMap harvest of insights into the genetics of common disease. , 2008, The Journal of clinical investigation.