Empirical null estimation using zero‐inflated discrete mixture distributions and its application to protein domain data

In recent mutation studies, analyses based on protein domain positions are gaining popularity over gene-centric approaches since the latter have limitations in considering the functional context that the position of the mutation provides. This presents a large-scale simultaneous inference problem, with hundreds of hypothesis tests to consider at the same time. This article aims to select significant mutation counts while controlling a given level of Type I error via False Discovery Rate (FDR) procedures. One main assumption is that the mutation counts follow a zero-inflated model in order to account for the true zeros in the count model and the excess zeros. The class of models considered is the Zero-inflated Generalized Poisson (ZIGP) distribution. Furthermore, we assumed that there exists a cut-off value such that smaller counts than this value are generated from the null distribution. We present several data-dependent methods to determine the cut-off value. We also consider a two-stage procedure based on screening process so that the number of mutations exceeding a certain value should be considered as significant mutations. Simulated and protein domain data sets are used to illustrate this procedure in estimation of the empirical null using a mixture of discrete distributions. Overall, while maintaining control of the FDR, the proposed two-stage testing procedure has superior empirical power.

[1]  Ram Jiwari,et al.  Traveling Wave Solutions for Shallow Water Wave Equation by (G'/G)-Expansion Method , 2013 .

[2]  J. Spouge,et al.  Objective method for estimating asymptotic parameters, with an application to sequence alignment. , 2011, Physical review. E, Statistical, nonlinear, and soft matter physics.

[3]  Sandrine Dudoit,et al.  Test Statistics Null Distributions in Multiple Testing: Simulation Studies and Applications to Genomics , 2005 .

[4]  C. Gottardi,et al.  Cadherins and cancer: how does cadherin dysfunction promote tumor progression? , 2008, Oncogene.

[5]  Thomas A. Peterson,et al.  Domain landscapes of somatic mutations in cancer , 2012, BMC Genomics.

[6]  Thomas A. Peterson,et al.  A protein domain-centric approach for the comparative analysis of human and yeast phenotypically relevant mutations , 2013, BMC Genomics.

[7]  Thomas A. Peterson,et al.  Incorporating molecular and functional context into the analysis and prioritization of human variants associated with cancer , 2012, J. Am. Medical Informatics Assoc..

[8]  John D. Storey,et al.  Empirical Bayes Analysis of a Microarray Experiment , 2001 .

[9]  Junyong Park,et al.  Estimation of empirical null using a mixture of normals and its use in local false discovery rate , 2011, Comput. Stat. Data Anal..

[10]  Yanan Sun,et al.  DMDM: domain mapping of disease mutations , 2010, Bioinform..

[11]  Diane Lambert,et al.  Zero-inflacted Poisson regression, with an application to defects in manufacturing , 1992 .

[12]  John D. Storey A direct approach to false discovery rates , 2002 .

[13]  G. McLachlan,et al.  Fitting mixture models to grouped and truncated data via the EM algorithm. , 1988, Biometrics.

[14]  J. Brugge,et al.  Signal transduction in cancer. , 2015, Cold Spring Harbor perspectives in medicine.

[15]  S. Dudoit,et al.  Multiple Hypothesis Testing in Microarray Experiments , 2003 .

[16]  D. Spandidos,et al.  The role of oncogenic kinases in human cancer (Review). , 2000, International journal of molecular medicine.

[17]  Hanns Hatt,et al.  Activation of an Olfactory Receptor Inhibits Proliferation of Prostate Cancer Cells* , 2009, The Journal of Biological Chemistry.

[18]  Harry Joe,et al.  Generalized Poisson Distribution: the Property of Mixture of Poisson and Comparison with Negative Binomial Distribution , 2005, Biometrical journal. Biometrische Zeitschrift.

[19]  M. You,et al.  Role of proto-oncogene activation in carcinogenesis. , 1992, Environmental health perspectives.