Detection boundary and Higher Criticism approach for rare and weak genetic effects

Genome-wide association studies (GWAS) have identified many genetic factors underlying complex human traits. However, these factors have explained only a small fraction of these traits' genetic heritability. It is argued that many more genetic factors remain undiscovered. These genetic factors likely are weakly associated at the population level and sparsely distributed across the genome. In this paper, we adapt the recent innovations on Tukey's Higher Criticism (Tukey [The Higher Criticism (1976) Princeton Univ.]; Donoho and Jin [Ann. Statist. 32 (2004) 962-994]) to SNP-set analysis of GWAS, and develop a new theoretical framework in large-scale inference to assess the joint significance of such rare and weak effects for a quantitative trait. In the core of our theory is the so-called detection boundary, a curve in the two-dimensional phase space that quantifies the rarity and strength of genetic effects. Above the detection boundary, the overall effects of genetic factors are strong enough for reliable detection. Below the detection boundary, the genetic factors are simply too rare and too weak for reliable detection. We show that the HC-type methods are optimal in that they reliably yield detection once the parameters of the genetic effects fall above the detection boundary and that many commonly used SNP-set methods are suboptimal. The superior performance of the HC-type approach is demonstrated through simulations and the analysis of a GWAS data set of Crohn's disease.

[1]  Tariq Ahmad,et al.  Genome-wide meta-analysis increases to 71 the number of confirmed Crohn's disease susceptibility loci , 2010, Nature Genetics.

[2]  Anbupalam Thalamuthu,et al.  Association tests using kernel‐based measures of multi‐locus genotype similarity between individuals , 2009, Genetic epidemiology.

[3]  Yijun Zuo,et al.  Two-Stage Designs in Case–Control Association Analysis , 2006, Genetics.

[4]  Peter Kraft,et al.  Genetic risk prediction--are we there yet? , 2009, The New England journal of medicine.

[5]  B. Siegmund Inflammatory Bowel Disease Clinical , 2015, Journal of gastroenterology and hepatology.

[6]  W. Sandborn,et al.  Inflammatory bowel disease: clinical aspects and established and evolving therapies , 2007, The Lancet.

[7]  B. Efron Correlation and Large-Scale Simultaneous Significance Testing , 2007 .

[8]  I. Leodolter [Crohn's disease]. , 1967, Wiener Zeitschrift fur innere Medizin und ihre Grenzgebiete.

[9]  P. Schoenfeld,et al.  The epidemiology and natural history of Crohn’s disease in population‐based patient cohorts from North America: a systematic review , 2002, Alimentary pharmacology & therapeutics.

[10]  N. Morton Genetic epidemiology , 1997, International Journal of Obesity.

[11]  Judy H. Cho,et al.  A Genome-Wide Association Study Identifies IL23R as an Inflammatory Bowel Disease Gene , 2006, Science.

[12]  J. Wellner Limit theorems for the ratio of the empirical distribution function to the true distribution function , 1978 .

[13]  P. Hall,et al.  PROPERTIES OF HIGHER CRITICISM UNDER STRONG DEPENDENCE , 2008, 0803.2095.

[14]  R. Pabst,et al.  Let's go mucosal: communication on slippery ground. , 2004, Trends in immunology.

[15]  Rachael P. Huntley,et al.  QuickGO: a web-based tool for Gene Ontology searching , 2009, Bioinform..

[16]  John D. Storey,et al.  Empirical Bayes Analysis of a Microarray Experiment , 2001 .

[17]  E. Mardis Next-generation DNA sequencing methods. , 2008, Annual review of genomics and human genetics.

[18]  Xihong Lin,et al.  Semiparametric Regression of Multidimensional Genetic Pathway Data: Least‐Squares Kernel Machines and Linear Mixed Models , 2007, Biometrics.

[19]  E. Loftus The epidemiology and natural history of Crohn’s disease in population-based patient cohorts from North America: a systematic review , 2001 .

[20]  Yu. I. Ingster,et al.  Detection boundary in sparse regression , 2010, 1009.1706.

[21]  P. Hall,et al.  Feature selection when there are many influential features , 2009, 0911.4076.

[22]  D. Donoho,et al.  Higher criticism thresholding: Optimal feature selection when useful features are rare and weak , 2008, Proceedings of the National Academy of Sciences.

[23]  C. Hoggart,et al.  Simultaneous Analysis of All SNPs in Genome-Wide and Re-Sequencing Association Studies , 2008, PLoS genetics.

[24]  W. Ansorge Next-generation DNA sequencing techniques. , 2009, New biotechnology.

[25]  D. Goldstein Common genetic variation and human traits. , 2009, The New England journal of medicine.

[26]  S. R. Driver The Higher Criticism , 1912 .

[27]  Deanne M. Taylor,et al.  Powerful SNP-set analysis for case-control genome-wide association studies. , 2010, American journal of human genetics.

[28]  J. Ott,et al.  Mathematical multi-locus approaches to localizing complex human trait genes , 2003, Nature Reviews Genetics.

[29]  J. Silberg,et al.  A transposase strategy for creating libraries of circularly permuted proteins , 2012, Nucleic acids research.

[30]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[31]  H. Cordell,et al.  SNP Selection in Genome-Wide and Candidate Gene Studies via Penalized Logistic Regression , 2010, Genetic epidemiology.

[32]  P. Rosenberg,et al.  Pathway analysis by adaptive combination of P‐values , 2009, Genetic epidemiology.

[33]  M. Xiong,et al.  Genome-wide gene and pathway analysis , 2010, European Journal of Human Genetics.

[34]  D. Falconer,et al.  Introduction to Quantitative Genetics. , 1961 .

[35]  M. McCarthy,et al.  Genome-wide association studies for complex traits: consensus, uncertainty and challenges , 2008, Nature Reviews Genetics.

[36]  C. Eun A Genome-Wide Association Study Identifies IL23R as an Inflammatory Bowel Disease Gene. , 2007 .

[37]  N. E. Mckinnon,et al.  Estimating the impact of the COVID-19 pandemic on rising trends in drug overdose mortality in the United States, 2018-2021 , 2022, Annals of Epidemiology.

[38]  G. Wallukat,et al.  Patients with preeclampsia develop agonistic autoantibodies against the angiotensin AT1 receptor. , 1999, The Journal of clinical investigation.

[39]  Momiao Xiong,et al.  Gene and pathway-based second-wave analysis of genome-wide association studies , 2010, European Journal of Human Genetics.

[40]  G. Dorn,et al.  GRK2-Dependent S1PR1 Desensitization Is Required for Lymphocytes to Overcome Their Attraction to Blood , 2011, Science.

[41]  J. Ott,et al.  Trimming, weighting, and grouping SNPs in human case-control association studies. , 2001, Genome research.

[42]  Hongyu Zhao,et al.  Statistical Power of Model Selection Strategies for Genome-Wide Association Studies , 2009, PLoS genetics.

[43]  Kai Wang,et al.  ATOM: a powerful gene-based association test by combining optimally weighted markers , 2009, Bioinform..

[44]  The UniProt Consortium,et al.  Reorganizing the protein space at the Universal Protein Resource (UniProt) , 2011, Nucleic Acids Res..

[45]  B. Efron Large-Scale Simultaneous Hypothesis Testing , 2004 .

[46]  Hongyu Zhao,et al.  ON MODEL SELECTION STRATEGIES TO IDENTIFY GENES UNDERLYING BINARY TRAITS USING GENOME-WIDE ASSOCIATION DATA. , 2012, Statistica Sinica.

[47]  Hsin-Chou Yang,et al.  Kernel-Based Association Test , 2008, Genetics.

[48]  The Higher Criticism: An Inaugural. , 1904 .

[49]  R. Frankham Introduction to quantitative genetics (4th edn): by Douglas S. Falconer and Trudy F.C. Mackay Longman, 1996. £24.99 pbk (xv and 464 pages) ISBN 0582 24302 5 , 1996 .

[50]  P. Hall,et al.  Innovated Higher Criticism for Detecting Sparse Signals in Correlated Noise , 2009, 0902.3837.

[51]  Trevor J. Hastie,et al.  Genome-wide association analysis by lasso penalized logistic regression , 2009, Bioinform..

[52]  B. Efron Size, power and false discovery rates , 2007, 0710.2245.

[53]  Kai Wang,et al.  Pathway-based approaches for analysis of genomewide association studies. , 2007, American journal of human genetics.

[54]  D. Donoho,et al.  Higher criticism for detecting sparse heterogeneous mixtures , 2004, math/0410072.

[55]  Karl Pearson,et al.  Mathematical contributions to the theory of evolution, On the law of ancestral heredity , 1898, Proceedings of the Royal Society of London.

[56]  M. Stephens,et al.  Bayesian variable selection regression for genome-wide association studies and other large-scale problems , 2011, 1110.6019.

[57]  M. Piedmonte,et al.  A Method for Generating High-Dimensional Multivariate Binary Variates , 1991 .

[58]  E. Candès,et al.  Global testing under sparse alternatives: ANOVA, multiple comparisons and the higher criticism , 2010, 1007.1434.

[59]  M. Marazita,et al.  Genome-wide Association Studies , 2012, Journal of dental research.

[60]  Kai Wang,et al.  A principal components regression approach to multilocus genetic association studies , 2008, Genetic epidemiology.

[61]  Xihong Lin,et al.  Hypothesis testing in semiparametric additive mixed models. , 2003, Biostatistics.

[62]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[63]  G. Mendel Versuche über Pflanzen-Hybriden , 1941, Der Zauchter Zeitschrift fur Theoretische und Angewandte Genetik.

[64]  L. Wasserman,et al.  Revisiting Marginal Regression , 2009, 0911.4080.

[65]  Karl Pearson Mathematical Contributions to the Theory of Evolution. XII. On a Generalised Theory of Alternative Inheritance, with Special Reference to Mendel's Laws , 1904 .

[66]  Hongzhe Li,et al.  Sample size and power analysis for sparse signal recovery in genome-wide association studies. , 2011, Biometrika.

[67]  Zheyang Wu,et al.  Gene-based Higher Criticism methods for large-scale exonic single-nucleotide polymorphism data , 2011, BMC proceedings.

[68]  G. Udny Yule,et al.  MENDEL'S LAWS AND THEIR PROBABLE RELATIONS TO INTRA‐RACIAL HEREDITY. , 1902 .

[69]  Judy H. Cho,et al.  Comparisons of multi‐marker association methods to detect association between a candidate region and disease , 2010, Genetic epidemiology.