Genome-wide detection and characterization of positive selection in human populations

With the advent of dense maps of human genetic variation, it is now possible to detect positive natural selection across the human genome. Here we report an analysis of over 3 million polymorphisms from the International HapMap Project Phase 2 (HapMap2). We used ‘long-range haplotype’ methods, which were developed to identify alleles segregating in a population that have undergone recent selection, and we also developed new methods that are based on cross-population comparisons to discover alleles that have swept to near-fixation within a population. The analysis reveals more than 300 strong candidate regions. Focusing on the strongest 22 regions, we develop a heuristic for scrutinizing these regions to identify candidate targets of selection. In a complementary analysis, we identify 26 nonsynonymous, coding, single nucleotide polymorphisms showing regional evidence of positive selection. Examination of these candidates highlights three cases in which two genes in a common biological process have apparently undergone positive selection in the same population: LARGE and DMD, both related to infection by the Lassa virus, in West Africa; SLC24A5 and SLC45A2, both involved in skin pigmentation, in Europe; and EDAR and EDA2R, both involved in development of hair follicles, in Asia. An increasing amount of information about genetic variation, together with new analytical methods, is making it possible to explore the recent evolutionary history of the human population. The first phase of the International Haplotype Map, including ,1 million single nucleotide polymorphisms (SNPs), allowed preliminary examination of natural selection in humans. Now, with the publication of the Phase 2 map (HapMap2) in a companion paper, over 3 million SNPs have been genotyped in 420 chromosomes from three continents (120 European (CEU), 120 African (YRI) and 180 Asian from Japan and China (JPT 1 CHB)). In our analysis of HapMap2, we first implemented two widely used tests that detect recent positive selection by finding common alleles carried on unusually long haplotypes. The two, the Long-Range Haplotype (LRH) and the integrated Haplotype Score (iHS) tests, rely on the principle that, under positive selection, an allele may rise to high frequency rapidly enough that long-range association with nearby polymorphisms—the long-range haplotype—will not have time to be eliminated by recombination. These tests control for local variation in recombination rates by comparing long haplotypes to other alleles at the same locus. As a result, they lose power as selected alleles approach fixation (100% frequency), because there are then few alternative alleles in the population (Supplementary Fig. 2 and Supplementary Tables 1–2). We next developed, evaluated and applied a new test, Cross Population Extended Haplotype Homozogysity (XP-EHH), to detect selective sweeps in which the selected allele has approached or achieved fixation in one population but remains polymorphic in the human population as a whole (Methods, and Supplementary Fig. 2 and Supplementary Tables 3–6). Related methods have recently also been described. Our analysis of recent positive selection, using the three methods, reveals more than 300 candidate regions(Supplementary Fig. 3 and Supplementary Table 7), 22 of which are above a threshold such that no similar events were found in 10 Gb of simulated neutrally evolving sequence (Methods). We focused on these 22 strongest signals (Table 1), which include two well-established cases, SLC24A5 and LCT, and 20 other regions with signals of similar strength. The challenge is to sift through genetic variation in the candidate regions to identify the variants that were the targets of selection. Our candidate regions are large (mean length, 815 kb; maximum length, 3.5 Mb) and often contain multiple genes (median, 4; maximum, 15). A typical region harbours ,400–4,000 common SNPs (minor allele frequency .5%), of which roughly three-quarters are represented in current SNP databases and half were genotyped as part of HapMap2 (Supplementary Table 8). We developed three criteria to help highlight potential targets of selection (Supplementary Fig. 1): (1) selected alleles detectable by our tests are likely to be derived (newly arisen), because long-haplotype tests have little power to detect selection on standing (pre-existing) variation; we therefore focused on derived alleles, as identified by comparison to primate outgroups; (2) selected alleles are likely to be highly differentiated between populations, because recent selection is probably a local environmental adaptation; we thus looked for alleles common in only the population(s) under selection; (3) selected alleles must have biological effects. On the basis of current knowledge, we therefore focused on non-synonymous coding SNPs and SNPs in evolutionarily conserved sequences. These criteria are intended as heuristics, not absolute requirements. Some targets of selection may not satisfy them, and some will not be in current SNP databases. Nonetheless, with ,50% of common SNPs in these populations genotyped in HapMap2, a search for causal variants is timely. We applied the criteria to the regions containing SLC24A5 and LCT, each of which already has a strong candidate gene, mutation and trait. At SLC24A5, the 600 kb region contains 914 genotyped

[1]  Stephen W. Fesik,et al.  NMR structure and mutagenesis of the Fas (APO-1/CD95) death domain , 1996, Nature.

[2]  G. Kleywegt Use of non-crystallographic symmetry in protein structure refinement. , 1996, Acta crystallographica. Section D, Biological crystallography.

[3]  G. Otting,et al.  NMR structure of the death domain of the p75 neurotrophin receptor , 1997, The EMBO journal.

[4]  A. Sali,et al.  Comparative protein structure modeling of genes and genomes. , 2000, Annual review of biophysics and biomolecular structure.

[5]  A Sankar,et al.  The three-dimensional solution structure and dynamic properties of the human FADD death domain. , 2000, Journal of molecular biology.

[6]  P. Donnelly,et al.  A new statistical method for haplotype reconstruction from population data. , 2001, American journal of human genetics.

[7]  M. Feldman,et al.  Genetic Structure of Human Populations , 2002, Science.

[8]  Pardis C Sabeti,et al.  Detecting recent positive selection in the human genome from haplotype structure , 2002, Nature.

[9]  Debbie Baglole,et al.  Lassa fever: epidemiology, clinical features, and social consequences , 2003, BMJ : British Medical Journal.

[10]  I. Järvelä,et al.  Transcriptional regulation of the lactase-phlorizin hydrolase gene by polymorphisms associated with adult-type hypolactasia , 2003, Gut.

[11]  Dana C Crawford,et al.  Evidence for substantial fine-scale variation in recombination rates across the human genome , 2004, Nature Genetics.

[12]  Pardis C Sabeti,et al.  Genetic signatures of strong recent positive selection at the lactase gene. , 2004, American journal of human genetics.

[13]  A. Sali,et al.  Alignment of protein sequences by their profiles , 2004, Protein science : a publication of the Protein Society.

[14]  M. Olivier A haplotype map of the human genome. , 2003, Nature.

[15]  V. Botchkarev,et al.  Edar signaling in the control of hair follicle development. , 2005, The journal of investigative dermatology. Symposium proceedings.

[16]  S. Gabriel,et al.  Calibrating a coalescent simulation of human genome sequence variation. , 2005, Genome research.

[17]  Richard Hodgson,et al.  Single nucleotide polymorphisms in the MATP gene are associated with normal human pigmentation variation , 2005, Human mutation.

[18]  K. Campbell,et al.  Posttranslational Modification of α-Dystroglycan, the Cellular Receptor for Arenaviruses, by the Glycosyltransferase LARGE Is Critical for Virus Binding , 2005, Journal of Virology.

[19]  Jeremy Schmutz,et al.  Widespread Parallel Evolution in Sticklebacks by Repeated Fixation of Ectodysplasin Alleles , 2005, Science.

[20]  Itay Mayrose,et al.  ConSurf 2005: the projection of evolutionary conservation scores of residues on protein structures , 2005, Nucleic Acids Res..

[21]  S. Nair,et al.  Its Implications for TLR Signaling IL-1 R-Associated Kinase-4 Death Domain Cutting Edge : Molecular Structure of the , 2005 .

[22]  Keith C. Cheng,et al.  SLC24A5, a Putative Cation Exchanger, Affects Pigmentation in Zebrafish and Humans , 2005, Science.

[23]  T. Ishida,et al.  Evidence for recent positive selection at the human AIM1 locus in a European population. , 2006, Molecular biology and evolution.

[24]  P. Calvas,et al.  Mutations in EDAR account for one‐quarter of non‐ED1‐related hypohidrotic ectodermal dysplasia , 2006, Human mutation.

[25]  H. H. Park,et al.  Crystal structure of RAIDD death domain implicates potential mechanism of PIDDosome assembly. , 2006, Journal of molecular biology.

[26]  Molly Przeworski,et al.  How reliable are empirical genomic scans for selective sweeps? , 2006, Genome research.

[27]  Pardis C Sabeti,et al.  Positive Natural Selection in the Human Lineage , 2006, Science.

[28]  J. Pritchard,et al.  A Map of Recent Positive Selection in the Human Genome , 2006, PLoS biology.

[29]  Kevin R. Thornton,et al.  A New Approach for Using Genome Scans to Detect Recent Positive Selection in the Human Genome , 2007, PLoS biology.

[30]  A. Fujimoto,et al.  A Practical Genome Scan for Population-Specific Strong Selective Sweeps That Have Reached Fixation , 2007, PloS one.

[31]  Carlos D Bustamante,et al.  Localizing Recent Adaptive Evolution in the Human Genome , 2007, PLoS genetics.

[32]  Zhaohui S. Qin,et al.  A second generation human haplotype map of over 3.1 million SNPs , 2007, Nature.