A tutorial on statistical methods for population association studies

Although genetic association studies have been with us for many years, even for the simplest analyses there is little consensus on the most appropriate statistical procedures. Here I give an overview of statistical approaches to population association studies, including preliminary analyses (Hardy–Weinberg equilibrium testing, inference of phase and missing data, and SNP tagging), and single-SNP and multipoint tests for association. My goal is to outline the key methods with a brief discussion of problems (population structure and multiple testing), avenues for solutions and some ongoing developments.

[1]  P. Armitage Tests for Linear Trends in Proportions and Frequencies , 1955 .

[2]  M. Kendall Theoretical Statistics , 1956, Nature.

[3]  R. Pyke,et al.  Logistic disease incidence models and case-control studies , 1979 .

[4]  E. Boerwinkle,et al.  A cladistic analysis of phenotypic associations with haplotypes inferred from restriction endonuclease mapping. I. Basic theory and an analysis of alcohol dehydrogenase activity in Drosophila. , 1987, Genetics.

[5]  P. McCullagh,et al.  Generalized Linear Models , 1992 .

[6]  P. McCullagh,et al.  Generalized Linear Models, 2nd Edn. , 1990 .

[7]  E. Thompson,et al.  Performing the exact test of Hardy-Weinberg proportion for multiple alleles. , 1992, Biometrics.

[8]  C. Sing,et al.  A cladistic analysis of phenotypic associations with haplotypes inferred from restriction endonuclease mapping and DNA sequence data. III. Cladogram estimation. , 1992, Genetics.

[9]  A. Agresti Categorical data analysis , 1993 .

[10]  Jack A. Taylor,et al.  Non-hierarchical logistic models and case-only designs for assessing susceptibility in population-based case-control studies. , 1994, Statistics in medicine.

[11]  N. Risch,et al.  A comparison of linkage disequilibrium measures for fine-scale mapping. , 1995, Genomics.

[12]  P. Sham Statistics in human genetics , 1997 .

[13]  P. Sasieni From genotypes to genes: doubling the sample size. , 1997, Biometrics.

[14]  M. Ehm,et al.  Detecting marker-disease association by testing for Hardy-Weinberg disequilibrium at a marker locus. , 1998, American journal of human genetics.

[15]  K. Roeder,et al.  Genomic Control for Association Studies , 1999, Biometrics.

[16]  Gonçalo R. Abecasis,et al.  GOLD-Graphical Overview of Linkage Disequilibrium , 2000, Bioinform..

[17]  G. Abecasis,et al.  A general test of association for quantitative traits in nuclear families. , 2000, American journal of human genetics.

[18]  P. Donnelly,et al.  Association mapping in structured populations. , 2000, American journal of human genetics.

[19]  G A Satten,et al.  Accounting for unmeasured population substructure in case-control studies of genetic association using a novel latent-class model. , 2001, American journal of human genetics.

[20]  P. Donnelly,et al.  A new statistical method for haplotype reconstruction from population data. , 2001, American journal of human genetics.

[21]  L. Wasserman,et al.  Genomic control, a new approach to genetic-based association studies. , 2001, Theoretical population biology.

[22]  A. Jeffreys,et al.  Intensely punctate meiotic recombination in the class II region of the major histocompatibility complex , 2001, Nature Genetics.

[23]  C. Fischer Handbook of statistical genetics: , 2002, Human Genetics.

[24]  L. Kruglyak,et al.  Patterns of linkage disequilibrium in the human genome , 2002, Nature Reviews Genetics.

[25]  H. Cordell Epistasis: what it means, what it doesn't mean, and statistical methods to detect it in humans. , 2002, Human molecular genetics.

[26]  S. Gabriel,et al.  The Structure of Haplotype Blocks in the Human Genome , 2002, Science.

[27]  N. E. Morton,et al.  The first linkage disequilibrium (LD) maps: Delineation of hot and cold blocks by diplotype analysis , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Joseph L. Gastwirth,et al.  Trend Tests for Case-Control Studies of Genetic Markers: Power, Sample Size and Robustness , 2002, Human Heredity.

[29]  D. Schaid,et al.  Score tests for association between traits and haplotypes when linkage phase is ambiguous. , 2002, American journal of human genetics.

[30]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[31]  D. Clayton,et al.  A unified stepwise regression procedure for evaluating the relative effects of polymorphisms within a gene using case/control or family data: application to HLA in type 1 diabetes. , 2002, American journal of human genetics.

[32]  J. Ott,et al.  Mathematical multi-locus approaches to localizing complex human trait genes , 2003, Nature Reviews Genetics.

[33]  N. Laird,et al.  Estimation and Tests of Haplotype-Environment Interaction when Linkage Phase Is Ambiguous , 2003, Human Heredity.

[34]  L. Cardon,et al.  Population stratification and spurious allelic association , 2003, The Lancet.

[35]  J. S. Rao,et al.  Detecting Differentially Expressed Genes in Microarrays Using Bayesian Model Selection , 2003 .

[36]  L. Wasserman,et al.  On the identification of disease mutations by the analysis of haplotype similarity and goodness of fit. , 2003, American journal of human genetics.

[37]  Juliet M Chapman,et al.  Detecting Disease Associations due to Linkage Disequilibrium Using Haplotype Tags: A Class of Tests and the Determinants of Statistical Power , 2003, Human Heredity.

[38]  P. Marjoram,et al.  Fine-scale mapping of disease genes with multiple mutations via spatial clustering techniques. , 2003, American journal of human genetics.

[39]  K. Roeder,et al.  Evolutionary‐based association analysis using haplotype data , 2003 .

[40]  M. Stephens,et al.  Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. , 2003, Genetics.

[41]  Mark D Shriver,et al.  Control of confounding of genetic associations in stratified populations. , 2003, American journal of human genetics.

[42]  Jason H. Moore,et al.  The Ubiquitous Nature of Epistasis in Determining Susceptibility to Common Human Diseases , 2003, Human Heredity.

[43]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[44]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[45]  Kathryn Roeder,et al.  Evolutionary‐based association analysis using haplotype data , 2003, Genetic epidemiology.

[46]  John D. Storey,et al.  Statistical Significance for Genome-Wide Studies , 2003 .

[47]  L. Cardon,et al.  Allelic association patterns for a dense SNP map , 2004, Genetic epidemiology.

[48]  A. Jeffreys,et al.  Intense and highly localized gene conversion activity in human meiotic crossover hot spots , 2004, Nature Genetics.

[49]  D. Schaid Evaluating associations of haplotypes with traits , 2004, Genetic epidemiology.

[50]  Lon R. Cardon,et al.  The complex interplay among factors that influence allelic association , 2004, Nature Reviews Genetics.

[51]  Jason Cooper,et al.  Use of unphased multilocus genotype data in indirect association studies , 2004, Genetic epidemiology.

[52]  Daniel O Stram,et al.  Tag SNP selection for association studies , 2004, Genetic epidemiology.

[53]  P. Deloukas,et al.  The impact of SNP density on fine-scale patterns of linkage disequilibrium. , 2004, Human molecular genetics.

[54]  E. Riboli,et al.  Diet and cancer — the European Prospective Investigation into Cancer and Nutrition , 2004, Nature Reviews Cancer.

[55]  Sylvia Richardson,et al.  Equivalence of prospective and retrospective models in the Bayesian analysis of case-control studies , 2004 .

[56]  Dana C Crawford,et al.  Evidence for substantial fine-scale variation in recombination rates across the human genome , 2004, Nature Genetics.

[57]  D. Swallow Human Evolutionary Genetics: Origins, Peoples & Disease , 2004, Journal of Medical Genetics.

[58]  P. Donnelly,et al.  The Fine-Scale Structure of Recombination Rate Variation in the Human Genome , 2004, Science.

[59]  A. Clark,et al.  The role of haplotypes in candidate gene studies , 2004, Genetic epidemiology.

[60]  Andrew P Morris,et al.  Linkage disequilibrium mapping via cladistic analysis of single-nucleotide polymorphism haplotypes. , 2004, American journal of human genetics.

[61]  D. Balding,et al.  Handbook of statistical genetics , 2004 .

[62]  C. Carlson,et al.  Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. , 2004, American journal of human genetics.

[63]  Chris S. Haley,et al.  Epistasis: too often neglected in complex trait studies? , 2004, Nature Reviews Genetics.

[64]  L. Cardon,et al.  The complex interplay among factors that influence allelic association , 2004, Nature Reviews Genetics.

[65]  Frank Dudbridge,et al.  Efficient computation of significance levels for multiple associations in large studies of correlated data, including genomewide association studies. , 2004, American journal of human genetics.

[66]  P. Donnelly,et al.  The effects of human population structure on large genetic association studies , 2004, Nature Genetics.

[67]  P. Donnelly,et al.  Genome-wide strategies for detecting multiple loci that influence complex diseases , 2005, Nature Genetics.

[68]  M. Daly,et al.  Genome-wide association studies for common diseases and complex traits , 2005, Nature Reviews Genetics.

[69]  A. P. Morris,et al.  Direct analysis of unphased SNP genotype data in population‐based association studies via Bayesian partition modelling of haplotypes , 2005, Genetic epidemiology.

[70]  M. Olivier A haplotype map of the human genome , 2003, Nature.

[71]  Sebastian Zöllner,et al.  Coalescent-Based Association Mapping and Fine Mapping of Complex Trait Loci , 2005, Genetics.

[72]  B. Weir,et al.  A comparison of tests for independence in the FBI RFLP data bases , 2005, Genetica.

[73]  M. Olivier A haplotype map of the human genome. , 2003, Nature.

[74]  D. Clayton,et al.  Population structure, differential bias and genomic control in a large-scale, case-control association study , 2005, Nature Genetics.

[75]  Nengjun Yi,et al.  Bayesian Model Selection for Genome-Wide Epistatic Quantitative Trait Loci Analysis , 2005, Genetics.

[76]  N. Morton,et al.  A map of the human genome in linkage disequilibrium units. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[77]  David Reich,et al.  A whole-genome admixture scan finds a candidate locus for multiple sclerosis susceptibility , 2005, Nature Genetics.

[78]  M. McCarthy,et al.  An evaluation of HapMap sample size and tagging SNP performance in large-scale empirical and simulated data sets , 2005, Nature Genetics.

[79]  Christoph Lange,et al.  Genomic screening and replication using the same data set in family-based association testing , 2005, Nature Genetics.

[80]  Jacqueline K. Wittke-Thompson,et al.  Rational inferences about departures from Hardy-Weinberg equilibrium. , 2005, American journal of human genetics.

[81]  Tim Sprosen,et al.  UK Biobank: from concept to reality. , 2005, Pharmacogenomics.

[82]  Shizhong Xu,et al.  Bayesian Shrinkage Estimation of Quantitative Trait Loci Parameters , 2005, Genetics.

[83]  Mark Daly,et al.  Haploview: analysis and visualization of LD and haplotype maps , 2005, Bioinform..

[84]  T. Sellers Statistical Methods in Genetic Epidemiology , 2005 .

[85]  J. Chang-Claude,et al.  Haplotype Sharing Analysis Using Mantel Statistics , 2005, Human Heredity.

[86]  S. O’Brien,et al.  Mapping by admixture linkage disequilibrium: advances, limitations and guidelines , 2005, Nature Reviews Genetics.

[87]  G. Abecasis,et al.  A note on exact tests of Hardy-Weinberg equilibrium. , 2005, American journal of human genetics.

[88]  D. Clayton,et al.  Improved power offered by a score test for linkage disequilibrium mapping of quantitative-trait loci by selective genotyping. , 2006, American journal of human genetics.

[89]  C. Bowman,et al.  Visualizing gene determinants of disease in drug discovery. , 2006, Pharmacogenomics.

[90]  David J. Lunn,et al.  A Bayesian toolkit for genetic association studies , 2006, Genetic epidemiology.

[91]  D. Conrad,et al.  A high-resolution survey of deletion polymorphism in the human genome , 2006, Nature Genetics.

[92]  E. Eichler,et al.  Primate segmental duplications: crucibles of evolution, diversity and disease , 2006, Nature Reviews Genetics.

[93]  Frank Dudbridge,et al.  Detecting multiple associations in genome-wide studies , 2006, Human Genomics.

[94]  Chuhsing Kate Hsiao,et al.  Regression-based association analysis with clustered haplotypes through use of genotypes. , 2006, American journal of human genetics.

[95]  Michael R. Johnson,et al.  Clinical factors and ABCB1 polymorphisms in prediction of antiepileptic drug response: a prospective cohort study , 2006, The Lancet Neurology.

[96]  Zhaohui S. Qin,et al.  A comparison of phasing algorithms for trios and unrelated individuals. , 2006, American journal of human genetics.

[97]  P. Deloukas,et al.  The portability of tagSNPs across populations: a worldwide survey. , 2006, Genome research.

[98]  Sharon R Browning,et al.  Multilocus association mapping using variable-length Markov chains. , 2006, American journal of human genetics.

[99]  J. Gastwirth,et al.  Robust genomic control for association studies. , 2006, American journal of human genetics.

[100]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[101]  David J Balding,et al.  Logistic regression protects against population structure in genetic association studies. , 2005, Genome research.

[102]  Paul Scheet,et al.  A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. , 2006, American journal of human genetics.

[103]  David V Conti,et al.  A testing framework for identifying susceptibility genes in the presence of epistasis. , 2006, American journal of human genetics.

[104]  I. Pe’er,et al.  Optimal two‐stage genotyping designs for genome‐wide association scans , 2006, Genetic epidemiology.

[105]  M. McMullen,et al.  A unified mixed-model method for association mapping that accounts for multiple levels of relatedness , 2006, Nature Genetics.

[106]  Wei Huang,et al.  Linkage disequilibrium sharing and haplotype-tagged SNP portability between populations , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[107]  A. Zwinderman,et al.  Multiple Imputation of Missing Genotype Data for Unrelated Individuals , 2006, Annals of human genetics.

[108]  J. Todd Statistical false positive or true disease pathway? , 2006, Nature Genetics.

[109]  D. Balding,et al.  Fine mapping of disease genes via haplotype clustering , 2006, Genetic epidemiology.

[110]  G. Abecasis,et al.  Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies , 2006, Nature Genetics.

[111]  D. Zeng,et al.  Likelihood-Based Inference on Haplotype Effects in Genetic Association Studies , 2006 .

[112]  Katrin Hoffmann,et al.  Hidden population substructures in an apparently homogeneous population bias association studies , 2006, European Journal of Human Genetics.

[113]  Claudio J. Verzilli,et al.  Bayesian graphical models for genomewide association studies. , 2006, American journal of human genetics.