A global reference for human genetic variation

The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.

[1]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[2]  D. H. Wahl,et al.  Principal Investigator , 2020, Encyclopedic Dictionary of Archaeology.

[3]  J. D. Parsons,et al.  Miropeats: graphical DNA sequence comparisons , 1995, Comput. Appl. Biosci..

[4]  Culture of immortalized cells , 1996 .

[5]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[6]  M. Feldman,et al.  Clines, Clusters, and the Effect of Study Design on the Inference of Human Population Structure , 2005, PLoS genetics.

[7]  J. Ott,et al.  Complement Factor H Polymorphism in Age-Related Macular Degeneration , 2005, Science.

[8]  Keith C. Cheng,et al.  SLC24A5, a Putative Cation Exchanger, Affects Pigmentation in Zebrafish and Humans , 2005, Science.

[9]  J. Knight,et al.  24. WELLCOME TRUST CENTRE FOR HUMAN GENETICS , 2005 .

[10]  S. Fisher,et al.  Hypothetical LOC387715 is a second major susceptibility gene for age-related macular degeneration, contributing independently of complement factor H to disease risk , 2005 .

[11]  R. T. Smith,et al.  Variation in factor B (BF) and complement component 2 (C2) genes is associated with age-related macular degeneration , 2006, Nature Genetics.

[12]  Hans Eiberg,et al.  Blue eye color in humans may be caused by a perfectly associated founder mutation in a regulatory element located within the HERC2 gene inhibiting OCA2 expression , 2008, Human Genetics.

[13]  I. Deary,et al.  Complement C3 variant and the risk of age-related macular degeneration. , 2007, The New England journal of medicine.

[14]  Zhaohui S. Qin,et al.  A second generation human haplotype map of over 3.1 million SNPs , 2007, Nature.

[15]  B. Browning,et al.  Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. , 2007, American journal of human genetics.

[16]  S. Fisher,et al.  Age-related macular degeneration is associated with an unstable ARMS2 (LOC387715) mRNA , 2008, Nature Genetics.

[17]  E. Birney,et al.  Genome-wide nucleotide-level mammalian ancestor reconstruction. , 2008, Genome research.

[18]  Joshua M. Korn,et al.  Mapping and sequencing of structural variation from eight human genomes , 2008, Nature.

[19]  David H. Alexander,et al.  Fast model-based estimation of ancestry in unrelated individuals. , 2009, Genome research.

[20]  Jon Wakefield,et al.  Bayes factors for genome‐wide association studies: comparison with P‐values , 2009, Genetic epidemiology.

[21]  Süleyman Cenk Sahinalp,et al.  Combinatorial Algorithms for Structural Variation Detection in High Throughput Sequenced Genomes , 2009, RECOMB.

[22]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[23]  R. Wilson,et al.  BreakDancer: An algorithm for high resolution mapping of genomic structural variation , 2009, Nature Methods.

[24]  Kai Ye,et al.  Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads , 2009, Bioinform..

[25]  Asan,et al.  Sequencing of 50 Human Exomes Reveals Adaptation to High Altitude , 2010, Science.

[26]  Matthew E. Ritchie,et al.  A re-annotation pipeline for Illumina BeadArrays: improving the interpretation of gene expression data , 2009, Nucleic acids research.

[27]  Peter H. Sudmant,et al.  Diversity of Human Copy Number Variation and Multicopy Genes , 2010, Science.

[28]  G. Weinstock,et al.  A SNP discovery method to assess variant allele probability from next-generation resequencing data. , 2010, Genome research.

[29]  Dennis C. Friedrich,et al.  A scalable, fully automated process for construction of sequence-ready human exome targeted capture libraries , 2011, Genome Biology.

[30]  Jared T. Simpson,et al.  Efficient construction of an assembly string graph using the FM-index , 2010, Bioinform..

[31]  Aaron R. Quinlan,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2022 .

[32]  Margaret A. Pericak-Vance,et al.  Genetic variants near TIMP3 and high-density lipoprotein–associated loci influence susceptibility to age-related macular degeneration , 2010, Proceedings of the National Academy of Sciences.

[33]  Faraz Hach,et al.  mrsFAST: a cache-oblivious algorithm for short-read mapping , 2010, Nature Methods.

[34]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[35]  T. Mikkelsen,et al.  The NIH Roadmap Epigenomics Mapping Consortium , 2010, Nature Biotechnology.

[36]  Daniel Rios,et al.  Bioinformatics Applications Note Databases and Ontologies Deriving the Consequences of Genomic Variants with the Ensembl Api and Snp Effect Predictor , 2022 .

[37]  Ryan D. Hernandez,et al.  Classic Selective Sweeps Were Rare in Recent Human Evolution , 2011, Science.

[38]  Heng Li,et al.  Improving SNP discovery by base alignment quality , 2011, Bioinform..

[39]  Sarah Edkins,et al.  Dense genotyping identifies and localizes multiple common and rare variant association signals in celiac disease , 2011, Nature Genetics.

[40]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[41]  Heng Li,et al.  A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data , 2011, Bioinform..

[42]  Andrew C. Adey,et al.  Haplotype-resolved genome sequencing of a Gujarati Indian individual , 2011, Nature Biotechnology.

[43]  Emmanouil Collab A map of human genome variation from population-scale sequencing , 2011, Nature.

[44]  Joshua M. Korn,et al.  Discovery and genotyping of genome structural polymorphism by sequencing on a population scale , 2011, Nature Genetics.

[45]  Aleksandar Milosavljevic,et al.  An integrative variant analysis suite for whole exome next-generation sequencing data , 2012, BMC Bioinformatics.

[46]  Jonathan Pevsner,et al.  Inference of Relationships in Population Data Using Identity-by-Descent and Identity-by-State , 2011, PLoS genetics.

[47]  R. Durbin,et al.  Inference of human population history from individual whole-genome sequences. , 2011, Nature.

[48]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..

[49]  M. Gerstein,et al.  CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. , 2011, Genome research.

[50]  R. Durbin,et al.  Dindel: accurate indel calls from short-read data. , 2011, Genome research.

[51]  J. Marchini,et al.  Genotype Imputation with Thousands of Genomes , 2011, G3: Genes | Genomes | Genetics.

[52]  Thomas Zichner,et al.  DELLY: structural variant discovery by integrated paired-end and split-read analysis , 2012, Bioinform..

[53]  P. Deloukas,et al.  Patterns of Cis Regulatory Variation in Diverse Human Populations , 2012, PLoS genetics.

[54]  C. Tyler-Smith,et al.  Deleterious- and disease-allele prevalence in healthy individuals: insights from current predictions, mutation databases, and population-scale resequencing. , 2012, American journal of human genetics.

[55]  Tanya M. Teslovich,et al.  The Metabochip, a Custom Genotyping Array for Genetic Studies of Metabolic, Cardiovascular, and Anthropometric Traits , 2012, PLoS genetics.

[56]  Gabor T. Marth,et al.  Haplotype-based variant detection from short-read sequencing , 2012, 1207.3907.

[57]  Jon Wakefield,et al.  Commentary: Genome-wide significance thresholds via Bayes factors. , 2012, International journal of epidemiology.

[58]  S. Rosset,et al.  lobSTR: A short tandem repeat profiler for personal genomes , 2012, RECOMB.

[59]  O. Delaneau,et al.  Supplementary Information for ‘ Improved whole chromosome phasing for disease and population genetic studies ’ , 2012 .

[60]  Jake K. Byrnes,et al.  Bayesian refinement of association signals for 14 loci in 3 common diseases , 2012, Nature Genetics.

[61]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[62]  Raymond K. Auerbach,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[63]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.

[64]  Bin Zhang,et al.  PhosphoSitePlus: a comprehensive resource for investigating the structure and function of experimentally determined post-translational modifications in man and mouse , 2011, Nucleic Acids Res..

[65]  I. Ruczinski,et al.  Adaptive Evolution of the FADS Gene Cluster within Africa , 2012, PloS one.

[66]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[67]  William Stafford Noble,et al.  Unsupervised pattern discovery in human chromatin structure through genomic segmentation , 2012, Nature Methods.

[68]  Andrey A. Shabalin,et al.  Matrix eQTL: ultra fast eQTL analysis via large matrix operations , 2011, Bioinform..

[69]  R. Durbin,et al.  Efficient de novo assembly of large genomes using compressed data structures. , 2012, Genome research.

[70]  Brendan W. Vaughan,et al.  The 1000 Genomes Project: data management and community access , 2012, Nature Methods.

[71]  K. Verstrepen,et al.  Fosmid-based whole genome haplotyping of a HapMap trio child: evaluation of Single Individual Haplotyping techniques , 2011, Nucleic acids research.

[72]  J. Marchini,et al.  Fast and accurate genotype imputation in genome-wide association studies through pre-phasing , 2012, Nature Genetics.

[73]  G. Abecasis,et al.  Detecting and estimating contamination of human DNA samples in sequencing and array-based genotype data. , 2012, American journal of human genetics.

[74]  Gabor T. Marth,et al.  Integrative Annotation of Variants from 1092 Humans: Application to Cancer Genomics , 2013, Science.

[75]  D. Hong,et al.  Systematic investigation of cancer-associated somatic point mutations in SNP databases , 2013, Nature Biotechnology.

[76]  B. Browning,et al.  Improving the Accuracy and Efficiency of Identity-by-Descent Detection in Population Data , 2013, Genetics.

[77]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[78]  M. Rojas,et al.  The smooth muscle-selective RhoGAP GRAF3 is a critical regulator of vascular tone and hypertension , 2013, Nature Communications.

[79]  James Lu,et al.  An integrative variant analysis pipeline for accurate genotype/haplotype inference in population NGS data , 2013, Genome research.

[80]  G. Abecasis,et al.  Sequencing Y Chromosomes Resolves Discrepancy in Time to Common Ancestor of Males Versus Females , 2013, Science.

[81]  Jonathan Marchini,et al.  Genotype calling and phasing using next-generation sequencing reads and a haplotype scaffold , 2013, Bioinform..

[82]  Aaron A. Klammer,et al.  Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data , 2013, Nature Methods.

[83]  Faisal M. Fadlelmola,et al.  Enabling Genomic Revolution in Africa , 2019, The Genetics of African Populations in Health and Disease.

[84]  John G. Cleary,et al.  Joint Variant and De Novo Mutation Identification on Pedigrees from High-Throughput Sequencing Data , 2014, bioRxiv.

[85]  M. Stephens,et al.  fastSTRUCTURE: Variational Inference of Population Structure in Large SNP Data Sets , 2014, Genetics.

[86]  Yaniv Erlich,et al.  The landscape of human STR variation , 2014, bioRxiv.

[87]  Olivier Delaneau,et al.  Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel , 2014, Nature Communications.

[88]  Shaun M. Purcell,et al.  Statistical power and significance testing in large-scale genetic studies , 2014, Nature Reviews Genetics.

[89]  David Haussler,et al.  Current status and new features of the Consensus Coding Sequence database , 2013, Nucleic Acids Res..

[90]  J. Al-Aama,et al.  A common Greenlandic TBC1D4 variant confers muscle insulin resistance and type 2 diabetes , 2014, Nature.

[91]  Ryan E. Mills,et al.  The genomic landscape of polymorphic human nuclear mitochondrial insertions , 2014, bioRxiv.

[92]  G. McVean,et al.  Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications , 2014, Nature Genetics.

[93]  C. Tyler-Smith,et al.  Human genomic regions with exceptionally high levels of population differentiation identified from 911 whole-genome sequences , 2014, Genome Biology.

[94]  Sergey A. Shiryev,et al.  Single haplotype assembly of the human genome from a hydatidiform mole , 2014, bioRxiv.

[95]  Jared T. Simpson,et al.  Exploring genome characteristics and sequence quality without a reference , 2013, Bioinform..

[96]  Ross M. Fraser,et al.  A General Approach for Haplotype Phasing across the Full Spectrum of Relatedness , 2014, PLoS genetics.

[97]  Gil McVean,et al.  Demography and the Age of Rare Variants , 2014, PLoS genetics.

[98]  J. Pritchard,et al.  The deleterious mutation load is insensitive to recent population history , 2013, Nature Genetics.

[99]  Kevin Y. Yip,et al.  FunSeq2: a framework for prioritizing noncoding regulatory variants in cancer , 2014, Genome Biology.

[100]  Alan M. Kwong,et al.  Genome sequencing elucidates Sardinian genetic architecture and augments association analyses for lipid and blood inflammatory markers , 2015, Nature Genetics.

[101]  P. Flicek,et al.  The Ensembl Regulatory Build , 2015, Genome Biology.

[102]  Tom R. Gaunt,et al.  The UK10K project identifies rare variants in health and disease , 2016 .

[103]  D. Reich,et al.  No evidence that selection has been less effective at removing deleterious mutations in Europeans than in Africans , 2014, Nature Genetics.

[104]  Mark J. P. Chaisson,et al.  Resolving the complexity of the human genome using single-molecule sequencing , 2014, Nature.

[105]  E. Zeggini,et al.  The African Genome Variation Project shapes medical genetics in Africa , 2014, Nature.

[106]  Bjarni V. Halldórsson,et al.  Large-scale whole-genome sequencing of the Icelandic population , 2015, Nature Genetics.

[107]  Gabor T. Marth,et al.  An integrated map of structural variation in 2,504 human genomes , 2015, Nature.

[108]  Andrew Carroll,et al.  Inexpensive and Highly Reproducible Cloud-Based Variant Calling of 2,535 Human Genomes , 2015, PloS one.

[109]  R. Handsaker,et al.  Large multi-allelic copy number variations in humans , 2015, Nature Genetics.

[110]  Albert J. Vilella,et al.  Ensembl comparative genomics resources , 2016, Database J. Biol. Databases Curation.