Statistical analysis strategies for association studies involving rare variants

The limitations of genome-wide association (GWA) studies that focus on the phenotypic influence of common genetic variants have motivated human geneticists to consider the contribution of rare variants to phenotypic expression. The increasing availability of high-throughput sequencing technologies has enabled studies of rare variants but these methods will not be sufficient for their success as appropriate analytical methods are also needed. We consider data analysis approaches to testing associations between a phenotype and collections of rare variants in a defined genomic region or set of regions. Ultimately, although a wide variety of analytical approaches exist, more work is needed to refine them and determine their properties and power in different contexts.

[1]  M. Hill Diversity and Evenness: A Unifying Notation and Its Consequences , 1973 .

[2]  Brian Everitt,et al.  Cluster analysis , 1974 .

[3]  D. Hartl,et al.  Principles of population genetics , 1981 .

[4]  J. Ott Analysis of Human Genetic Linkage , 1985 .

[5]  M. Nei Molecular Evolutionary Genetics , 1987 .

[6]  R. Lande Statistics and partitioning of species diversity, and similarity among multiple communities , 1996 .

[7]  L Kruglyak,et al.  Parametric and nonparametric linkage analysis: a unified multipoint approach. , 1996, American journal of human genetics.

[8]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[9]  N Risch,et al.  The Future of Genetic Studies of Complex Human Diseases , 1996, Science.

[10]  L. Eguiarte Principles of population genetics: by D. L. Hard and A. G. Clark, Sinauer Associates Inc. Publishers, Sunderland, MA, 1997. $58.95 (casebound), xiii + 542 pp. ISBN 0-87893-306-9 , 1998 .

[11]  P. Sham,et al.  Model-Free Analysis and Permutation Tests for Allelic Associations , 1999, Human Heredity.

[12]  N. Schork,et al.  Linkage disequilibrium analysis of biallelic DNA markers, human quantitative trait loci, and threshold-defined case and control subjects. , 2000, American journal of human genetics.

[13]  J. Ott,et al.  Scan statistics to scan markers for susceptibility genes. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[14]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[15]  James T. Elder,et al.  Localization of psoriasis-susceptibility locus PSORS1 to a 60-kb interval telomeric to HLA-C. , 2000, American journal of human genetics.

[16]  D. Mount Bioinformatics: Sequence and Genome Analysis , 2001 .

[17]  C Kooperberg,et al.  Sequence Analysis Using Logic Regression , 2001, Genetic epidemiology.

[18]  J. Pritchard Are rare variants responsible for susceptibility to complex diseases? , 2001, American journal of human genetics.

[19]  D C Thomas,et al.  Genome Scan of Complex Traits by Haplotype Sharing Correlation , 2001, Genetic epidemiology.

[20]  Peter B. Gilbert,et al.  An Efficient Test for Comparing Sequence Diversity between Two Populations , 2001, J. Comput. Biol..

[21]  N. Schork,et al.  Genetic analysis of case/control data using estimated haplotype frequencies: application to APOE locus variation and Alzheimer's disease. , 2001, Genome research.

[22]  Steven C. Lawlor,et al.  GenMAPP, a new tool for viewing and analyzing microarray data on biological pathways , 2002, Nature Genetics.

[23]  Momiao Xiong,et al.  Generalized T2 test for genome association studies. , 2002, American journal of human genetics.

[24]  Marcello Pagano,et al.  A Nonparametric Test of Gene Region Heterogeneity Associated With Phenotype , 2002 .

[25]  Milan Macek,et al.  Cystic fibrosis: A worldwide analysis of CFTR mutations—correlation with incidence data and application to screening , 2002, Human mutation.

[26]  P. Shannon,et al.  Cytoscape: a software environment for integrated models of biomolecular interaction networks. , 2003, Genome research.

[27]  Jinhua Wang,et al.  ESEfinder: a web resource to identify exonic splicing enhancers , 2003, Nucleic Acids Res..

[28]  M. Barnes,et al.  Bioinformatics for geneticists. , 2003 .

[29]  L. Wasserman,et al.  On the identification of disease mutations by the analysis of haplotype similarity and goodness of fit. , 2003, American journal of human genetics.

[30]  Brad T. Sherman,et al.  DAVID: Database for Annotation, Visualization, and Integrated Discovery , 2003, Genome Biology.

[31]  Christopher B. Burge,et al.  Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals , 2003, RECOMB '03.

[32]  Gene W. Yeo,et al.  Systematic Identification and Analysis of Exonic Splicing Silencers , 2004, Cell.

[33]  N. Bresolin,et al.  Silencer elements as possible inhibitors of pseudoexon splicing. , 2004, Nucleic acids research.

[34]  Jason Cooper,et al.  Use of unphased multilocus genotype data in indirect association studies , 2004, Genetic epidemiology.

[35]  Jonathan C. Cohen,et al.  Multiple Rare Alleles Contribute to Low Plasma Levels of HDL Cholesterol , 2004, Science.

[36]  Bruce Winney,et al.  Multiple rare variants in different genes account for multifactorial inherited susceptibility to colorectal adenomas. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[37]  N. Schork,et al.  Functional allelic heterogeneity and pleiotropy of a repeat polymorphism in tyrosine hydroxylase: prediction of catecholamines and response to stress in twins. , 2004, Physiological genomics.

[38]  Wyeth W. Wasserman,et al.  ConSite: web-based prediction of regulatory elements using cross-species comparison , 2004, Nucleic Acids Res..

[39]  B. Efron Correlation and Large-Scale Simultaneous Significance Testing , 2007 .

[40]  Alexander Pertsemlidis,et al.  Low LDL cholesterol in individuals of African descent resulting from frequent nonsense mutations in PCSK9 , 2005, Nature Genetics.

[41]  M. De Iorio,et al.  Finding Associations in Dense Genetic Maps: A Genetic Algorithm Approach , 2005, Human Heredity.

[42]  D. Hunter Gene–environment interactions in human diseases , 2005, Nature Reviews Genetics.

[43]  C. Burge,et al.  Conserved Seed Pairing, Often Flanked by Adenosines, Indicates that Thousands of Human Genes are MicroRNA Targets , 2005, Cell.

[44]  Xiaofeng Zhu,et al.  Haplotypes produced from rare variants in the promoter and coding regions of angiotensinogen contribute to variation in angiotensinogen levels. , 2005, Human molecular genetics.

[45]  Alan R. Templeton,et al.  Tree Scanning , 2005, Genetics.

[46]  C. Keylock Simpson diversity and the Shannon–Wiener index as special cases of a generalized entropy , 2005 .

[47]  Jonathan C. Cohen,et al.  A spectrum of PCSK9 alleles contributes to plasma levels of low-density lipoprotein cholesterol. , 2006, American journal of human genetics.

[48]  G. Ast,et al.  Comparative analysis identifies exonic splicing regulatory sequences--The complex definition of enhancers and silencers. , 2006, Molecular cell.

[49]  Jonathan C. Cohen,et al.  Sequence variations in PCSK9, low LDL, and protection against coronary heart disease. , 2006, The New England journal of medicine.

[50]  S. Henikoff,et al.  Predicting the effects of amino acid substitutions on protein function. , 2006, Annual review of genomics and human genetics.

[51]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[52]  Peter Holmans,et al.  Effects of Differential Genotyping Error Rate on the Type I Error Probability of Case-Control Studies , 2006, Human Heredity.

[53]  N. Schork,et al.  Generalized genomic distance-based regression methodology for multilocus association analysis. , 2006, American journal of human genetics.

[54]  Alexander E. Kel,et al.  TRANSFAC® and its module TRANSCompel®: transcriptional gene regulation in eukaryotes , 2005, Nucleic Acids Res..

[55]  C. Sing,et al.  Subsets of SNPs define rare genotype classes that predict ischemic heart disease , 2006, Human Genetics.

[56]  Marti J. Anderson,et al.  Distance‐Based Tests for Homogeneity of Multivariate Dispersions , 2006, Biometrics.

[57]  Jonathan C. Cohen,et al.  Multiple rare variants in NPC1L1 associated with reduced sterol absorption and plasma low-density lipoprotein levels. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[58]  A. Whittemore,et al.  Multiple regions within 8q24 independently affect risk for prostate cancer , 2007, Nature Genetics.

[59]  C. Férec,et al.  Association of rare chymotrypsinogen C (CTRC) gene variations in patients with idiopathic chronic pancreatitis , 2008, Human Genetics.

[60]  W. Thilly,et al.  A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (CAST). , 2007, Mutation research.

[61]  Timothy B. Stockwell,et al.  The Diploid Genome Sequence of an Individual Human , 2007, PLoS biology.

[62]  S. Yusuf,et al.  Resequencing Genomic DNA of Patients With Severe Hypertriglyceridemia (MIM 144650) , 2007, Arteriosclerosis, thrombosis, and vascular biology.

[63]  Inna Dubchak,et al.  VISTA Enhancer Browser—a database of tissue-specific human enhancers , 2006, Nucleic Acids Res..

[64]  A. Sparks,et al.  The Genomic Landscapes of Human Breast and Colorectal Cancers , 2007, Science.

[65]  N. Schork,et al.  Powerful designs for genetic association studies that consider twins and sibling pairs with discordant genotypes , 2007, Genetic epidemiology.

[66]  M. J. van der Laan,et al.  Statistical Applications in Genetics and Molecular Biology Super Learner , 2010 .

[67]  Michael S. Waterman,et al.  Accuracy Assessment of Diploid Consensus Sequences , 2007, TCBB.

[68]  Mark J van der Laan,et al.  Super Learning: An Application to the Prediction of HIV-1 Drug Resistance , 2007, Statistical applications in genetics and molecular biology.

[69]  Ondrej Libiger,et al.  Generalized Analysis of Molecular Variance , 2007, PLoS genetics.

[70]  William Stafford Noble,et al.  Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project , 2007, Nature.

[71]  Maria De Iorio,et al.  Genetic Association Mapping via Evolution-Based Clustering of Haplotypes , 2007, PLoS genetics.

[72]  Matthew Suderman,et al.  Tools for visually exploring biological networks , 2007, Bioinform..

[73]  G. K. Vostokin,et al.  Chemical characterization of element 112 , 2007, Nature.

[74]  Roded Sharan,et al.  Medical sequencing at the extremes of human body mass. , 2006, American journal of human genetics.

[75]  A. Clark,et al.  Full-Exon Resequencing Reveals Toll-Like Receptor Variants Contribute to Human Susceptibility to Tuberculosis Disease , 2007, PloS one.

[76]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[77]  Eric Boerwinkle,et al.  Understanding the accuracy of statistical haplotype inference with sequence data of known phase , 2007, Genetic epidemiology.

[78]  D. Altshuler,et al.  Completing the map of human genetic variation , 2007, Nature.

[79]  Kristilyn Eliason,et al.  Multiple rare nonsynonymous variants in the adenomatous polyposis coli gene predispose to colorectal adenomas. , 2008, Cancer research.

[80]  M. Spitz,et al.  Shifting paradigm of association studies: value of rare single-nucleotide polymorphisms. , 2008, American journal of human genetics.

[81]  S. Leal,et al.  Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. , 2008, American journal of human genetics.

[82]  Stijn van Dongen,et al.  miRBase: tools for microRNA genomics , 2007, Nucleic Acids Res..

[83]  Francis S Collins,et al.  A HapMap harvest of insights into the genetics of common disease. , 2008, The Journal of clinical investigation.

[84]  C. Hoggart,et al.  Simultaneous Analysis of All SNPs in Genome-Wide and Re-Sequencing Association Studies , 2008, PLoS genetics.

[85]  Jasper Rine,et al.  The prevalence of folate-remedial MTHFR enzyme variants in humans , 2008, Proceedings of the National Academy of Sciences.

[86]  J. Zschocke Dominant versus recessive: Molecular mechanisms in metabolic disease , 2008, Journal of Inherited Metabolic Disease.

[87]  Hongzhe Li,et al.  Group additive regression models for genomic data analysis. , 2008, Biostatistics.

[88]  David J. Arenillas,et al.  In Silico Detection of Sequence Variations Modifying Transcriptional Regulation , 2007, PLoS Comput. Biol..

[89]  Michael Q. Zhang,et al.  RNA landscape of evolution for optimal exon and intron discrimination , 2008, Proceedings of the National Academy of Sciences.

[90]  M. Suchard,et al.  Alignment Uncertainty and Genomic Analysis , 2008, Science.

[91]  Hongyu Zhao,et al.  Rare independent mutations in renal salt handling genes contribute to blood pressure variation , 2008, Nature Genetics.

[92]  N. Schork,et al.  Accommodating linkage disequilibrium in genetic-association analyses via ridge regression. , 2008, American journal of human genetics.

[93]  Xihong Lin,et al.  A powerful and flexible multilocus association test for quantitative traits. , 2008, American journal of human genetics.

[94]  W. Bodmer,et al.  Common and rare variants in multifactorial susceptibility to common diseases , 2008, Nature Genetics.

[95]  L. Shulman,et al.  A Systematic Genetic Assessment of 1,433 Sequence Variants of Unknown Clinical Significance in the BRCA1 and BRCA2 Breast Cancer–Predisposition Genes , 2008 .

[96]  G. Jones,et al.  Novel rare mutations and promoter haplotypes in ABCA1 contribute to low‐HDL‐C levels , 2008, Clinical genetics.

[97]  Jennifer Wessel,et al.  DNA sequence-based phenotypic association analysis. , 2008, Advances in genetics.

[98]  M. Daly,et al.  Genetic Mapping in Human Disease , 2008, Science.

[99]  L. Jost GST and its relatives do not measure differentiation , 2008, Molecular ecology.

[100]  Judy H. Cho,et al.  Finding the missing heritability of complex diseases , 2009, Nature.

[101]  K. Holsinger,et al.  Genetics in geographically structured populations: defining, estimating and interpreting FST , 2009, Nature Reviews Genetics.

[102]  G. Abecasis,et al.  Genotype imputation. , 2009, Annual review of genomics and human genetics.

[103]  Wei Guo,et al.  Generalized linear modeling with regularization for detecting common disease rare haplotype association , 2009, Genetic epidemiology.

[104]  Suzanne M. Leal,et al.  Discovery of Rare Variants via Sequencing: Implications for the Design of Complex Trait Association Studies , 2009, PLoS genetics.

[105]  J. Pritchard,et al.  Characterizing natural variation using next-generation sequencing technologies. , 2009, Trends in genetics : TIG.

[106]  K. Frazer,et al.  Common vs. rare allele hypotheses for complex diseases. , 2009, Current opinion in genetics & development.

[107]  H. Cordell Detecting gene–gene interactions that underlie human diseases , 2009, Nature Reviews Genetics.

[108]  S. Browning,et al.  A Groupwise Association Test for Rare Mutations Using a Weighted Sum Statistic , 2009, PLoS genetics.

[109]  Daniel J Schaid,et al.  Power comparisons between similarity‐based multilocus association methods, logistic regression, and score tests for haplotypes , 2009, Genetic epidemiology.

[110]  J. Todd,et al.  Rare Variants of IFIH1, a Gene Implicated in Antiviral Responses, Protect Against Type 1 Diabetes , 2009, Science.

[111]  Rachel Karchin,et al.  Next generation tools for the annotation of human SNPs , 2009, Briefings Bioinform..

[112]  Eric Boerwinkle,et al.  Rare loss-of-function mutations in ANGPTL family members contribute to plasma triglyceride levels in humans. , 2008, The Journal of clinical investigation.

[113]  K. Frazer,et al.  Human genetic variation and its contribution to complex traits , 2009, Nature Reviews Genetics.

[114]  R. Collins,et al.  Genetic variants associated with Lp(a) lipoprotein level and coronary disease. , 2009, The New England journal of medicine.

[115]  Daniel F. Gudbjartsson,et al.  Parental origin of sequence variants associated with complex diseases , 2009, Nature.

[116]  Jung-Ying Tzeng,et al.  Gene‐Trait Similarity Regression for Multimarker‐Based Association Analysis , 2009, Biometrics.

[117]  C. Nievergelt,et al.  Comparison of Genetic Distance Measures Using Human SNP Genotype Data , 2009, Human biology.

[118]  Lincoln Stein,et al.  Reactome knowledgebase of human biological pathways and processes , 2008, Nucleic Acids Res..

[119]  Wei Pan,et al.  Test Selection with Application to Detecting Disease Association with Multiple SNPs , 2009, Human Heredity.

[120]  J. Sebat,et al.  Rare structural variants in schizophrenia: one disorder, multiple mutations; one mutation, multiple disorders. , 2009, Trends in genetics : TIG.

[121]  Emily H Turner,et al.  Targeted Capture and Massively Parallel Sequencing of Twelve Human Exomes , 2009, Nature.

[122]  V. Salomaa,et al.  Excess of rare variants in genes identified by genome-wide association study of hypertriglyceridemia , 2010, Nature Genetics.

[123]  M. King,et al.  Genetic Heterogeneity in Human Disease , 2010, Cell.

[124]  M. Rivas,et al.  Nature Genetics Advance Online Publication High-throughput, Pooled Sequencing Identifies Mutations in Nubpl and Foxred1 in Human Complex I Deficiency , 2022 .

[125]  Anbupalam Thalamuthu,et al.  Association tests using kernel‐based measures of multi‐locus genotype similarity between individuals , 2009, Genetic epidemiology.

[126]  A. Chao,et al.  Partitioning diversity for conservation analyses , 2010 .

[127]  R. Elston,et al.  Detecting rare variants for complex traits using family and unrelated data , 2010, Genetic epidemiology.

[128]  Eleftheria Zeggini,et al.  Rare variant association analysis methods for complex traits. , 2010, Annual review of genetics.

[129]  N. Schork,et al.  Extremes of Unexplained Variation as a Phenotype: An Efficient Approach for Genome-Wide Association Studies of Cardiovascular Disease , 2010, Circulation. Cardiovascular genetics.

[130]  Hua Zhou,et al.  Association screening of common and rare genetic variants by penalized regression , 2010, Bioinform..

[131]  Greg Gibson,et al.  Common genetic variation and performance on standardized cognitive tests , 2010, European Journal of Human Genetics.

[132]  Lee-Jen Wei,et al.  Pooled Association Tests for Rare Variants in Exon-Resequencing Studies , 2010 .

[133]  H. Kang,et al.  Variance component model to account for sample structure in genome-wide association studies , 2010, Nature Genetics.

[134]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[135]  David B. Goldstein,et al.  Rare Variants Create Synthetic Genome-Wide Associations , 2010, PLoS biology.

[136]  Gary D Bader,et al.  Functional impact of global rare copy number variation in autism spectrum disorders , 2010, Nature.

[137]  P. Shannon,et al.  Exome sequencing identifies the cause of a Mendelian disorder , 2009, Nature Genetics.

[138]  E. Zeggini,et al.  An Evaluation of Statistical Approaches to Rare Variant Analysis in Genetic Association Studies , 2009, Genetic epidemiology.

[139]  N. Schork,et al.  Kinase mutations in human disease: interpreting genotype–phenotype relationships , 2010, Nature Reviews Genetics.

[140]  P. Shannon,et al.  Analysis of Genetic Inheritance in a Family Quartet by Whole-Genome Sequencing , 2010, Science.

[141]  K. Oexle A remark on rare variants , 2010, Journal of Human Genetics.

[142]  B. Tycko Mapping allele-specific DNA methylation: a new tool for maximizing information from GWAS. , 2010, American journal of human genetics.

[143]  Gaurav Bhatia,et al.  A Covering Method for Detecting Genetic Associations between Rare Variants and Common Phenotypes , 2010, PLoS Comput. Biol..

[144]  Susumu Goto,et al.  KEGG for representation and analysis of molecular networks involving diseases and drugs , 2009, Nucleic Acids Res..

[145]  David Haussler,et al.  The UCSC Genome Browser database: update 2010 , 2009, Nucleic Acids Res..

[146]  Samuel P. Dickson,et al.  Interpretation of association signals and identification of causal variants from genome-wide association studies. , 2010, American journal of human genetics.

[147]  Wei Pan,et al.  A Data-Adaptive Sum Test for Disease Association with Multiple Common or Rare Variants , 2010, Human Heredity.

[148]  Mary Goldman,et al.  The UCSC Genome Browser database: update 2011 , 2010, Nucleic Acids Res..

[149]  Colin B Begg,et al.  Hierarchical Modeling for Estimating Relative Risks of Rare Genetic Variants: Properties of the Pseudo‐Likelihood Method , 2011, Biometrics.

[150]  Vikas Bansal,et al.  An Application and Empirical Comparison of Statistical Analysis Methods for Associating Rare Variants to a Complex Phenotype , 2011, Pacific Symposium on Biocomputing.

[151]  J. Friedman Fast sparse regression and classification , 2012 .

[152]  R. Amann,et al.  Predictive Identification of Exonic Splicing Enhancers in Human Genes , 2022 .