Real world scenarios in rare variant association analysis: the impact of imbalance and sample size on the power in silico

BackgroundThe development of sequencing techniques and statistical methods provides great opportunities for identifying the impact of rare genetic variation on complex traits. However, there is a lack of knowledge on the impact of sample size, case numbers, the balance of cases vs controls for both burden and dispersion based rare variant association methods. For example, Phenome-Wide Association Studies may have a wide range of case and control sample sizes across hundreds of diagnoses and traits, and with the application of statistical methods to rare variants, it is important to understand the strengths and limitations of the analyses.ResultsWe conducted a large-scale simulation of randomly selected low-frequency protein-coding regions using twelve different balanced samples with an equal number of cases and controls as well as twenty-one unbalanced sample scenarios. We further explored statistical performance of different minor allele frequency thresholds and a range of genetic effect sizes. Our simulation results demonstrate that using an unbalanced study design has an overall higher type I error rate for both burden and dispersion tests compared with a balanced study design. Regression has an overall higher type I error with balanced cases and controls, while SKAT has higher type I error for unbalanced case-control scenarios. We also found that both type I error and power were driven by the number of cases in addition to the case to control ratio under large control group scenarios. Based on our power simulations, we observed that a SKAT analysis with case numbers larger than 200 for unbalanced case-control models yielded over 90% power with relatively well controlled type I error. To achieve similar power in regression, over 500 cases are needed. Moreover, SKAT showed higher power to detect associations in unbalanced case-control scenarios than regression.ConclusionsOur results provide important insights into rare variant association study designs by providing a landscape of type I error and statistical power for a wide range of sample sizes. These results can serve as a benchmark for making decisions about study design for rare variant analyses.

[1]  Hadley Wickham,et al.  Reshaping Data with the reshape Package , 2007 .

[2]  M. Daly,et al.  Searching for missing heritability: Designing rare variant association studies , 2014, Proceedings of the National Academy of Sciences.

[3]  Greg Gibson,et al.  Rare and common variants: twenty arguments , 2012, Nature Reviews Genetics.

[4]  Ren-Hua Chung,et al.  SeqSIMLA2: Simulating Correlated Quantitative Traits Accounting for Shared Environmental Effects in User‐Specified Pedigree Structure , 2015, Genetic epidemiology.

[5]  Marylyn D Ritchie,et al.  BioBin: a bioinformatics tool for automating the binning of rare variants using publicly available biological knowledge , 2013, BMC Medical Genomics.

[6]  Lei Sun,et al.  Robust and Powerful Tests for Rare Variants Using Fisher's Method to Combine Evidence of Association From Two or More Complementary Tests , 2013, Genetic epidemiology.

[7]  Marylyn D. Ritchie,et al.  Knowledge Driven Binning and PheWAS Analysis in Marshfield Personalized Medicine Research Project Using BioBin , 2016, PSB.

[8]  Yingye Zheng,et al.  A Unified Mixed‐Effects Model for Rare‐Variant Association in Sequencing Studies , 2013, Genetic epidemiology.

[9]  Virgilio Gómez-Rubio,et al.  ggplot2 - Elegant Graphics for Data Analysis (2nd Edition) , 2017 .

[10]  Russ B. Altman,et al.  PharmGKB: the Pharmacogenetics Knowledge Base , 2002, Nucleic Acids Res..

[11]  D. Goldstein,et al.  Uncovering the roles of rare variants in common disease through whole-genome sequencing , 2010, Nature Reviews Genetics.

[12]  Marylyn D. Ritchie,et al.  Novel features and enhancements in BioBin, a tool for the biologically inspired binning and association analysis of rare variants , 2018, Bioinform..

[13]  Marylyn D. Ritchie,et al.  Using BioBin to Explore Rare Variant Population Stratification , 2012, Pacific Symposium on Biocomputing.

[14]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[15]  S. Leal,et al.  Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. , 2008, American journal of human genetics.

[16]  John R. Wallace,et al.  A biologically informed method for detecting rare variant associations , 2016, BioData Mining.

[17]  Dana C. Crawford,et al.  Unravelling the human genome–phenome relationship using phenome-wide association studies , 2016, Nature Reviews Genetics.

[18]  S. Browning,et al.  A Groupwise Association Test for Rare Mutations Using a Weighted Sum Statistic , 2009, PLoS genetics.

[19]  Matthew R. Nelson,et al.  Comparison of Statistical Tests for Association between Rare Variants and Binary Traits , 2012, PloS one.

[20]  Gene Ontology Consortium The Gene Ontology (GO) database and informatics resource , 2003 .

[21]  Wei Pan,et al.  Comparison of statistical tests for disease association with rare variants , 2011, Genetic epidemiology.

[22]  Xihong Lin,et al.  Rare-variant association testing for sequencing data with the sequence kernel association test. , 2011, American journal of human genetics.

[23]  Li Hsu,et al.  An exponential combination procedure for set-based association tests in sequencing studies. , 2012, American journal of human genetics.

[24]  Kathryn Roeder,et al.  Testing for an Unusual Distribution of Rare Variants , 2011, PLoS genetics.

[25]  B Marshall,et al.  Gene Ontology Consortium: The Gene Ontology (GO) database and informatics resource , 2004, Nucleic Acids Res..

[26]  W. Thilly,et al.  A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (CAST). , 2007, Mutation research.

[27]  G. Abecasis,et al.  Rare-variant association analysis: study designs and statistical tests. , 2014, American journal of human genetics.

[28]  E. Lander,et al.  On the allelic spectrum of human disease. , 2001, Trends in genetics : TIG.

[29]  M. Ritchie,et al.  Phenome-Wide Association Studies: Leveraging Comprehensive Phenotypic and Genotypic Data for Discovery , 2015, Current Genetic Medicine Reports.

[30]  V. Bansal,et al.  Statistical analysis strategies for association studies involving rare variants , 2010, Nature Reviews Genetics.

[31]  J. Pritchard Are rare variants responsible for susceptibility to complex diseases? , 2001, American journal of human genetics.

[32]  Judy H. Cho,et al.  Finding the missing heritability of complex diseases , 2009, Nature.

[33]  Xihong Lin,et al.  Optimal tests for rare variant effects in sequencing association studies. , 2012, Biostatistics.

[34]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology , 2003, Nucleic Acids Res..

[35]  Marylyn D. Ritchie,et al.  PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene–disease associations , 2010, Bioinform..

[36]  Anurag Verma,et al.  Current Scope and Challenges in Phenome-Wide Association Studies , 2017, Current Epidemiology Reports.

[37]  John S. Witte,et al.  Comprehensive Approach to Analyzing Rare Genetic Variants , 2010, PloS one.

[38]  Yogasudha Veturi,et al.  Rare variants in drug target genes contributing to complex diseases, phenome-wide , 2018, Scientific Reports.

[39]  Wei Pan,et al.  A Data-Adaptive Sum Test for Disease Association with Multiple Common or Rare Variants , 2010, Human Heredity.

[40]  Marylyn D. Ritchie,et al.  Distribution and clinical impact of functional variants in 50,726 whole-exome sequences from the DiscovEHR study , 2016, Science.