A modified generalized Fisher method for combining probabilities from dependent tests

Rapid developments in molecular technology have yielded a large amount of high throughput genetic data to understand the mechanism for complex traits. The increase of genetic variants requires hundreds and thousands of statistical tests to be performed simultaneously in analysis, which poses a challenge to control the overall Type I error rate. Combining p-values from multiple hypothesis testing has shown promise for aggregating effects in high-dimensional genetic data analysis. Several p-value combining methods have been developed and applied to genetic data; see Dai et al. (2012b) for a comprehensive review. However, there is a lack of investigations conducted for dependent genetic data, especially for weighted p-value combining methods. Single nucleotide polymorphisms (SNPs) are often correlated due to linkage disequilibrium (LD). Other genetic data, including variants from next generation sequencing, gene expression levels measured by microarray, protein and DNA methylation data, etc. also contain complex correlation structures. Ignoring correlation structures among genetic variants may lead to severe inflation of Type I error rates for omnibus testing of p-values. In this work, we propose modifications to the Lancaster procedure by taking the correlation structure among p-values into account. The weight function in the Lancaster procedure allows meaningful biological information to be incorporated into the statistical analysis, which can increase the power of the statistical testing and/or remove the bias in the process. Extensive empirical assessments demonstrate that the modified Lancaster procedure largely reduces the Type I error rates due to correlation among p-values, and retains considerable power to detect signals among p-values. We applied our method to reassess published renal transplant data, and identified a novel association between B cell pathways and allograft tolerance.

[1]  L. Wasserman,et al.  False discovery control with p-value weighting , 2006 .

[2]  Richard Charnigo,et al.  Omnibus testing and gene filtration in microarray data analysis , 2008 .

[3]  Momiao Xiong,et al.  Gene and pathway-based second-wave analysis of genome-wide association studies , 2010, European Journal of Human Genetics.

[4]  R. Charnigo,et al.  Integrating P-values for Genetic and Genomic Data Analysis , 2012 .

[5]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[6]  E. B. Wilson,et al.  The Distribution of Chi-Square. , 1931, Proceedings of the National Academy of Sciences of the United States of America.

[7]  E. Suchman,et al.  The American Soldier: Adjustment During Army Life. , 1949 .

[8]  Y. Nikolsky,et al.  Protein networks and pathway analysis. Preface. , 2009, Methods in molecular biology.

[9]  Ramon C. Littell,et al.  Asymptotic Optimality of Fisher's Method of Combining Independent Tests , 1971 .

[10]  Atul J. Butte,et al.  Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges , 2012, PLoS Comput. Biol..

[11]  S. Rice,et al.  Saddle point approximation for the distribution of the sum of independent random variables , 1980, Advances in Applied Probability.

[12]  R. Fisher On the Interpretation of χ 2 from Contingency Tables , and the Calculation of P Author , 2022 .

[13]  Michael C Wu,et al.  Prior biological knowledge-based approaches for the analysis of genome-wide expression profiles using gene sets and pathways , 2009, Statistical methods in medical research.

[14]  J. Steven Leeder,et al.  Global tests of P-values for multifactor dimensionality reduction models in selection of optimal number of target genes , 2012, BioData Mining.

[15]  H. O. Lancaster THE COMBINATION OF PROBABILITIES: AN APPLICATION OF ORTHONORMAL FUNCTIONS , 1961 .

[16]  M. Kendall Statistical Methods for Research Workers , 1937, Nature.

[17]  Lincoln Stein,et al.  Reactome knowledgebase of human biological pathways and processes , 2008, Nucleic Acids Res..

[18]  R. Fisher,et al.  Statistical Methods for Research Workers , 1930, Nature.

[19]  Minping Qian,et al.  Gene-Centric Genomewide Association Study via Entropy , 2008, Genetics.

[20]  P. Patnaik THE NON-CENTRAL χ2- AND F-DISTRIBUTIONS AND THEIR APPLICATIONS , 1949 .

[21]  Julie Bryant,et al.  Protein Networks and Pathway Analysis , 2009, Methods in Molecular Biology.

[22]  P. Patnaik The Non-central X^2- and F- distribution and Their Applications , 1949 .

[23]  Stan Pounds,et al.  False discovery rate paradigms for statistical analyses of microarray gene expression data , 2007, Bioinformation.

[24]  Kai Wang,et al.  Pathway-based approaches for analysis of genomewide association studies. , 2007, American journal of human genetics.

[25]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[26]  Susumu Goto,et al.  The KEGG resource for deciphering the genome , 2004, Nucleic Acids Res..

[27]  R. R. Bahadur Rates of Convergence of Estimates and Test Statistics , 1967 .

[28]  M. Suthanthiran,et al.  Identification of a B cell signature associated with renal transplant tolerance in humans. , 2010, The Journal of clinical investigation.

[29]  James A. Koziol A Note on Lancaster's Procedure for the Combination of Independent Events , 1996 .

[30]  R. Fisher On the Interpretation of χ2 from Contingency Tables, and the Calculation of P , 2018, Journal of the Royal Statistical Society Series A (Statistics in Society).

[31]  Nathaniel Rothman,et al.  Assessing the probability that a positive report is false: an approach for molecular epidemiology studies. , 2004, Journal of the National Cancer Institute.

[32]  E. Cornish,et al.  The Percentile Points of Distributions Having Known Cumulants , 1960 .

[33]  Yuehua Cui,et al.  A combined p-value approach to infer pathway regulations in eQTL mapping , 2011 .