Ancestry estimation and control of population stratification for sequence-based association studies

Estimating individual ancestry is important in genetic association studies where population structure leads to false positive signals, although assigning ancestry remains challenging with targeted sequence data. We propose a new method for the accurate estimation of individual genetic ancestry, based on direct analysis of off-target sequence reads, and implement our method in the publicly available LASER software. We validate the method using simulated and empirical data and show that the method can accurately infer worldwide continental ancestry when used with sequencing data sets with whole-genome shotgun coverage as low as 0.001×. For estimates of fine-scale ancestry within Europe, the method performs well with coverage of 0.1×. On an even finer scale, the method improves discrimination between exome-sequenced study participants originating from different provinces within Finland. Finally, we show that our method can be used to improve case-control matching in genetic association studies and to reduce the risk of spurious findings due to population structure.

[1]  Weihua Guan,et al.  Genotype‐based matching to correct for population stratification in large‐scale case‐control genetic association studies , 2009, Genetic epidemiology.

[2]  P. Donnelly,et al.  The effects of human population structure on large genetic association studies , 2004, Nature Genetics.

[3]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[4]  Ying Liu,et al.  FaST linear mixed models for genome-wide association studies , 2011, Nature Methods.

[5]  John Novembre,et al.  The Population Reference Sample, POPRES: a resource for population, disease, and pharmacological genetics research. , 2008, American journal of human genetics.

[6]  R N Bergman,et al.  Mapping Genes for NIDDM: Design of the Finland—United States Investigation of NIDDM Genetics (FUSION) Study , 1998, Diabetes Care.

[7]  Sivakumar Gowrisankar,et al.  A rare penetrant mutation in CFH confers high risk of age-related macular degeneration , 2011, Nature Genetics.

[8]  P. Schönemann,et al.  Fitting one matrix to another under choice of a central dilation and a rigid motion , 1970 .

[9]  Richard R. Hudson,et al.  Generating samples under a Wright-Fisher neutral model of genetic variation , 2002, Bioinform..

[10]  Terence P. Speed,et al.  High-quality DNA sequence capture of 524 disease candidate genes , 2011, Proceedings of the National Academy of Sciences.

[11]  H. Kang,et al.  Variance component model to account for sample structure in genome-wide association studies , 2010, Nature Genetics.

[12]  S. Gabriel,et al.  Analysis of 6,515 exomes reveals a recent origin of most human protein-coding variants , 2012, Nature.

[13]  Bradley P. Coe,et al.  Copy number variation detection and genotyping from exome sequence data , 2012, Genome research.

[14]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[15]  M. Jakobsson,et al.  Origins and Genetic Legacy of Neolithic Farmers and Hunter-Gatherers in Europe , 2012, Science.

[16]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[17]  Russ Wolfinger,et al.  SNP selection and multidimensional scaling to quantify population structure , 2009, Genetic epidemiology.

[18]  D. Clayton,et al.  Population structure, differential bias and genomic control in a large-scale, case-control association study , 2005, Nature Genetics.

[19]  Hugo Y. K. Lam,et al.  Performance comparison of exome DNA sequencing technologies , 2011, Nature Biotechnology.

[20]  Noah A. Rosenberg,et al.  A Quantitative Comparison of the Similarity between Genes and Geography in Worldwide Human Populations , 2012, PLoS genetics.

[21]  Taylor J. Maxwell,et al.  Deep resequencing reveals excess rare recent variants consistent with explosive population growth , 2010, Nature communications.

[22]  K. Frazer,et al.  Human genetic variation and its contribution to complex traits , 2009, Nature Reviews Genetics.

[23]  G. Abecasis,et al.  Detecting and estimating contamination of human DNA samples in sequencing and array-based genotype data. , 2012, American journal of human genetics.

[24]  L. Cardon,et al.  Population stratification and spurious allelic association , 2003, The Lancet.

[25]  M. Feldman,et al.  Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation , 2008 .

[26]  M. Stephens,et al.  Genome-wide Efficient Mixed Model Analysis for Association Studies , 2012, Nature Genetics.

[27]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[28]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..

[29]  Chengsong Zhu,et al.  Nonmetric Multidimensional Scaling Corrects for Population Structure in Association Mapping With Different Sample Types , 2009, Genetics.

[30]  D. Reich,et al.  Population Structure and Eigenanalysis , 2006, PLoS genetics.

[31]  Claudio J. Verzilli,et al.  An Abundance of Rare Functional Variants in 202 Drug Target Genes Sequenced in 14,002 People , 2012, Science.

[32]  J. Shendure,et al.  Exome sequencing as a tool for Mendelian disease gene discovery , 2011, Nature Reviews Genetics.

[33]  Xiaofeng Zhu,et al.  On a semiparametric test to detect associations between quantitative traits and candidate genes using unrelated individuals , 2003, Genetic epidemiology.

[34]  Alkes L. Price,et al.  New approaches to population stratification in genome-wide association studies , 2010, Nature Reviews Genetics.

[35]  Margaret A. Pericak-Vance,et al.  Genetic variants near TIMP3 and high-density lipoprotein–associated loci influence susceptibility to age-related macular degeneration , 2010, Proceedings of the National Academy of Sciences.

[36]  Margaret A. Pericak-Vance,et al.  Identification of a Rare Coding Variant in Complement 3 Associated with Age-related Macular Degeneration , 2013, Nature Genetics.

[37]  Sharon R Grossman,et al.  Integrating common and rare genetic variation in diverse human populations , 2010, Nature.

[38]  M. Daly,et al.  Genetic Mapping in Human Disease , 2008, Science.

[39]  Emily H Turner,et al.  Target-enrichment strategies for next-generation sequencing , 2010, Nature Methods.

[40]  Eran Halperin,et al.  A model based approach for analysis of spatial structure in genetic data , 2013 .

[41]  E. Banks,et al.  Discovery and statistical genotyping of copy-number variation from whole-exome sequencing depth. , 2012, American journal of human genetics.

[42]  Si Quang Le,et al.  SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples. , 2011, Genome research.

[43]  Joshua M. Korn,et al.  Deep resequencing of GWAS loci identifies independent rare variants associated with inflammatory bowel disease , 2011, Nature Genetics.

[44]  Amit R. Indap,et al.  Genes mirror geography within Europe , 2008, Nature.

[45]  F. Collins,et al.  Potential etiologic and functional implications of genome-wide association loci for human diseases and traits , 2009, Proceedings of the National Academy of Sciences.

[46]  Zachary A. Szpiech,et al.  Statistical Applications in Genetics and Molecular Biology Comparing Spatial Maps of Human Population-Genetic Variation Using Procrustes Analysis , 2011 .

[47]  M. McCarthy,et al.  Genome-wide association studies for complex traits: consensus, uncertainty and challenges , 2008, Nature Reviews Genetics.

[48]  J. Todd,et al.  Rare Variants of IFIH1, a Gene Implicated in Antiviral Responses, Protect Against Type 1 Diabetes , 2009, Science.

[49]  K. Holsinger,et al.  Genetics in geographically structured populations: defining, estimating and interpreting FST , 2009, Nature Reviews Genetics.

[50]  L. Liang,et al.  Extremely low-coverage sequencing and imputation increases power for genome-wide association studies , 2012, Nature Genetics.

[51]  G. Abecasis,et al.  Low-coverage sequencing: implications for design of complex trait association studies. , 2011, Genome research.

[52]  G. McVean,et al.  Differential confounding of rare and common variants in spatially structured populations , 2011, Nature Genetics.