rPCMP: robust p-value combination by multiple partitions with applications to ATAC-seq data

BackgroundEvaluating the significance for a group of genes or proteins in a pathway or biological process for a disease could help researchers understand the mechanism of the disease. For example, identifying related pathways or gene functions for chromatin states of tumor-specific T cells will help determine whether T cells could reprogram or not, and further help design the cancer treatment strategy. Some existing p-value combination methods can be used in this scenario. However, these methods suffer from different disadvantages, and thus it is still challenging to design more powerful and robust statistical method.ResultsThe existing method of Group combined p-value (GCP) first partitions p-values to several groups using a set of several truncation points, but the method is often sensitive to these truncation points. Another method of adaptive rank truncated product method(ARTP) makes use of multiple truncation integers to adaptively combine the smallest p-values, but the method loses statistical power since it ignores the larger p-values. To tackle these problems, we propose a robust p-value combination method (rPCMP) by considering multiple partitions of p-values with different sets of truncation points. The proposed rPCMP statistic have a three-layer hierarchical structure. The inner-layer considers a statistic which combines p-values in a specified interval defined by two thresholds points, the intermediate-layer uses a GCP statistic which optimizes the statistic from the inner layer for a partition set of threshold points, and the outer-layer integrates the GCP statistic from multiple partitions of p-values. The empirical distribution of statistic under null distribution could be estimated by permutation procedure.ConclusionsOur proposed rPCMP method has been shown to be more robust and have higher statistical power. Simulation study shows that our method can effectively control the type I error rates and have higher statistical power than the existing methods. We finally apply our rPCMP method to an ATAC-seq dataset for discovering the related gene functions with chromatin states in mouse tumors T cell.

[1]  A. Hess,et al.  Fisher's combined p-value for detecting differentially expressed genes using Affymetrix expression arrays , 2007, BMC Genomics.

[2]  Yijun Zuo,et al.  A powerful truncated tail strength method for testing multiple null hypotheses in one dataset. , 2011, Journal of theoretical biology.

[3]  Yongchao Ge Resampling-based Multiple Testing for Microarray Data Analysis , 2003 .

[4]  Colin O. Wu,et al.  Joint Analysis of Binary and Quantitative Traits With Data Sharing and Outcome‐Dependent Sampling , 2012, Genetic epidemiology.

[5]  M. Kendall Statistical Methods for Research Workers , 1937, Nature.

[6]  B S Weir,et al.  Truncated product method for combining P‐values , 2002, Genetic epidemiology.

[7]  Gang Zheng,et al.  Fisher's method of combining dependent statistics using generalizations of the gamma distribution with applications to genetic pleiotropic associations. , 2014, Biostatistics.

[8]  P. Rosenberg,et al.  Pathway analysis by adaptive combination of P‐values , 2009, Genetic epidemiology.

[9]  Jeffrey J Delrow,et al.  Tumor-Specific T Cell Dysfunction Is a Dynamic Antigen-Driven Differentiation Program Initiated Early during Tumorigenesis. , 2016, Immunity.

[10]  Christina S. Leslie,et al.  Chromatin states define tumor-specific T cell dysfunction and reprogramming , 2017, Nature.

[11]  S. S. Young,et al.  Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment , 1993 .

[12]  Qizhai Li,et al.  Group-combined P-values with applications to genetic association studies , 2016, Bioinform..

[13]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Howard Y. Chang,et al.  Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position , 2013, Nature Methods.

[15]  R. Pfeiffer,et al.  A Powerful Method for Combining P‐Values in Genomic Studies , 2013, Genetic epidemiology.

[16]  R. Tibshirani,et al.  A tail strength measure for assessing the overall univariate significance in a dataset. , 2005, Biostatistics.

[17]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[18]  W. Willett,et al.  A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer , 2007, Nature Genetics.

[19]  Frank Dudbridge,et al.  Rank truncated product of P‐values, with application to genomewide association scans , 2003, Genetic epidemiology.