A regression framework for the proportion of true null hypotheses

Modern scientific studies from many diverse areas of research abound with multiple hypothesis testing concerns. The false discovery rate is one of the most commonly used error rates for measuring and controlling rates of false discoveries when performing multiple tests. Adaptive false discovery rates rely on an estimate of the proportion of null hypotheses among all the hypotheses being tested. This proportion is typically estimated once for each collection of hypotheses. Here we propose a regression framework to estimate the proportion of null hypotheses conditional on observed covariates. We provide both finite sample and asymptotic conditions under which this covariate-adjusted estimate is conservative - leading to appropriately conservative false discovery rate estimates. Our case study concerns a genome-wise association meta-analysis which considers associations with body mass index. In our framework, we are able to use the sample sizes for the individual genomic loci and the minor allele frequencies as covariates. We further evaluate our approach via a number of simulation scenarios.

[1]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[2]  B. Efron SIMULTANEOUS INFERENCE : WHEN SHOULD HYPOTHESIS TESTING PROBLEMS BE COMBINED? , 2008, 0803.3863.

[3]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[4]  Jeffrey T Leek,et al.  An estimate of the science-wise false discovery rate and application to the top medical literature. , 2014, Biostatistics.

[5]  John D. Storey,et al.  Empirical Bayes Analysis of a Microarray Experiment , 2001 .

[6]  James G. Scott,et al.  False Discovery Rate Regression: An Application to Neural Synchrony Detection in Primary Visual Cortex , 2013, Journal of the American Statistical Association.

[7]  Susanne Walitza,et al.  Meta-analysis of genome-wide association studies of attention-deficit/hyperactivity disorder. , 2010, Journal of the American Academy of Child and Adolescent Psychiatry.

[8]  Hanlee P. Ji,et al.  Next-generation DNA sequencing , 2008, Nature Biotechnology.

[9]  John D. Storey A direct approach to false discovery rates , 2002 .

[10]  David B. Allison,et al.  A mixture model approach for the analysis of microarray gene expression data , 2002 .

[11]  J. Leek svaseq: removing batch effects and other unwanted noise from sequencing data , 2014, bioRxiv.

[12]  M. Daly,et al.  Genome-wide association studies for common diseases and complex traits , 2005, Nature Reviews Genetics.

[13]  John C. Lindon,et al.  The handbook of metabonomics and metabolomics , 2007 .

[14]  K. Roeder,et al.  Genomic Control for Association Studies , 1999, Biometrics.

[15]  Jeffrey T. Leek,et al.  Statistical Applications in Genetics and Molecular Biology The Joint Null Criterion for Multiple Hypothesis Tests , 2011 .

[16]  K. Bussell Signalling: Friendly rivalry , 2005, Nature Reviews Molecular Cell Biology.

[17]  Judith B. Zaugg Data-driven hypothesis weighting increases detection power in big data analytics , 2015 .

[18]  Empirical estimates suggest most published medical research is true , 2013, 1301.3718.

[19]  Zhaohui S. Qin,et al.  A second generation human haplotype map of over 3.1 million SNPs , 2007, Nature.

[20]  Ross M. Fraser,et al.  Genetic studies of body mass index yield new insights for obesity biology , 2015, Nature.

[21]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[22]  D. Allison,et al.  Microarray data analysis: from disarray to consolidation and consensus , 2006, Nature Reviews Genetics.

[23]  Judith B. Zaugg,et al.  Data-driven hypothesis weighting increases detection power in genome-scale multiple testing , 2016, Nature Methods.

[24]  Brian Caffo,et al.  A Decision‐Theory Approach to Interpretable Set Analysis for High‐Dimensional Data , 2013, Biometrics.

[25]  Cheng Li,et al.  Adjusting batch effects in microarray expression data using empirical Bayes methods. , 2007, Biostatistics.

[26]  R. Welsch,et al.  The Hat Matrix in Regression and ANOVA , 1978 .

[27]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[28]  Stephan Beck,et al.  Advances in epigenome-wide association studies for common diseases , 2014, Trends in molecular medicine.

[29]  Alain Monfort,et al.  Asymptotic properties of the maximum likelihood estimator in dichotomous logit models , 1981 .

[30]  John D. Storey,et al.  Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis , 2007, PLoS genetics.

[31]  Stan Pounds,et al.  Estimating the Occurrence of False Positives and False Negatives in Microarray Studies by Approximating and Partitioning the Empirical Distribution of P-values , 2003, Bioinform..