Empirical Null Estimation using Discrete Mixture Distributions and its Application to Protein Domain Data

In recent mutation studies, analyses based on protein domain positions are gaining popularity over gene-centric approaches since the latter have limitations in considering the functional context that the position of the mutation provides. This presents a large-scale simultaneous inference problem, with hundreds of hypothesis tests to consider at the same time. This paper aims to select significant mutation counts while controlling a given level of Type I error via False Discovery Rate (FDR) procedures. One main assumption is that there exists a cut-off value such that smaller counts than this value are generated from the null distribution. We present several data-dependent methods to determine the cut-off value. We also consider a two-stage procedure based on screening process so that the number of mutations exceeding a certain value should be considered as significant mutations. Simulated and protein domain data sets are used to illustrate this procedure in estimation of the empirical null using a mixture of discrete distributions.

[1]  I. Leray,et al.  Promotion of Cancer Cell Invasiveness and Metastasis Emergence Caused by Olfactory Receptor Stimulation , 2014, PloS one.

[2]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[3]  M. Tsai,et al.  Ankyrin repeat: a unique motif mediating protein-protein interactions. , 2006, Biochemistry.

[4]  Sandrine Dudoit,et al.  Test Statistics Null Distributions in Multiple Testing: Simulation Studies and Applications to Genomics , 2005 .

[5]  B. Efron Large-Scale Simultaneous Hypothesis Testing , 2004 .

[6]  J. Spouge,et al.  Objective method for estimating asymptotic parameters, with an application to sequence alignment. , 2011, Physical review. E, Statistical, nonlinear, and soft matter physics.

[7]  Felix Famoye,et al.  Zero-Inflated Generalized Poisson Regression Model with an Application to Domestic Violence Data , 2021, Journal of Data Science.

[8]  J. Brugge,et al.  Signal transduction in cancer. , 2015, Cold Spring Harbor perspectives in medicine.

[9]  C. Czado,et al.  Modelling count data with overdispersion and spatial effects , 2008 .

[10]  T. Cai,et al.  Estimating the Null and the Proportion of Nonnull Effects in Large-Scale Multiple Comparisons , 2006, math/0611108.

[11]  C. Gottardi,et al.  Cadherins and cancer: how does cadherin dysfunction promote tumor progression? , 2008, Oncogene.

[12]  Jochen Haag,et al.  KRAS, NRAS, PIK3CA exon 20, and BRAF genotypes in synchronous and metachronous primary colorectal cancers diagnostic and therapeutic implications. , 2011, The Journal of molecular diagnostics : JMD.

[13]  M. You,et al.  Role of proto-oncogene activation in carcinogenesis. , 1992, Environmental health perspectives.

[14]  Bradley Efron,et al.  Large-scale inference , 2010 .

[15]  M. Cline Keynote address: The role of proto-oncogenes in human cancer: Implications for diagnosis and treatment , 1987 .

[16]  Yanan Yang,et al.  Map2k4 Functions as a Tumor Suppressor in Lung Adenocarcinoma and Inhibits Tumor Cell Invasion by Decreasing Peroxisome Proliferator-Activated Receptor γ2 Expression , 2011, Molecular and Cellular Biology.

[17]  Bradley Efron,et al.  Local False Discovery Rates , 2005 .

[18]  T Sasazuki,et al.  Activated Ki-ras enhances sensitivity of ceramide-induced apoptosis without c-Jun NH2-terminal kinase/stress-activated protein kinase or extracellular signal-regulated kinase activation in human colon cancer cells. , 1997, Cancer research.

[19]  Diane Lambert,et al.  Zero-inflacted Poisson regression, with an application to defects in manufacturing , 1992 .

[20]  S. Dudoit,et al.  Multiple Hypothesis Testing in Microarray Experiments , 2003 .

[21]  Andrew D. Yates,et al.  Somatic mutations of the protein kinase gene family in human lung cancer. , 2005, Cancer research.

[22]  Bernhard Klar,et al.  BOUNDS ON TAIL PROBABILITIES OF DISCRETE DISTRIBUTIONS , 2000, Probability in the Engineering and Informational Sciences.

[23]  Yosef Yarden,et al.  Roles for Growth Factors in Cancer , 2010 .

[24]  Thomas A. Peterson,et al.  Incorporating molecular and functional context into the analysis and prioritization of human variants associated with cancer , 2012, J. Am. Medical Informatics Assoc..

[25]  Tomohide Tsukahara,et al.  Olfactory Receptor Family 7 Subfamily C Member 1 Is a Novel Marker of Colon Cancer–Initiating Cells and Is a Potent Target of Immunotherapy , 2016, Clinical Cancer Research.

[26]  Yanan Sun,et al.  DMDM: domain mapping of disease mutations , 2010, Bioinform..

[27]  M. Takeichi Cadherins in cancer: implications for invasion and metastasis. , 1993, Current opinion in cell biology.

[28]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[29]  G. McLachlan,et al.  Fitting mixture models to grouped and truncated data via the EM algorithm. , 1988, Biometrics.

[30]  B. Efron Doing thousands of hypothesis tests at the same time , 2007 .

[31]  P. Consul,et al.  A Generalization of the Poisson Distribution , 1973 .

[32]  Jingming Ma,et al.  Modeling Count Outcomes from HIV Risk Reduction Interventions: A Comparison of Competing Statistical Models for Count Responses , 2012, AIDS research and treatment.

[33]  Thomas A. Peterson,et al.  Domain landscapes of somatic mutations in cancer. , 2013, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[34]  John D. Storey A direct approach to false discovery rates , 2002 .

[35]  John D. Storey,et al.  Empirical Bayes Analysis of a Microarray Experiment , 2001 .

[36]  Y. Phang,et al.  Zero Inflated Models for Overdispersed Count Data , 2013 .

[37]  Giovanni Parmigiani,et al.  STATISTICAL METHODS FOR THE ANALYSIS OF CANCER GENOME SEQUENCING DATA , 2007 .

[38]  Harry Joe,et al.  Generalized Poisson Distribution: the Property of Mixture of Poisson and Comparison with Negative Binomial Distribution , 2005, Biometrical journal. Biometrische Zeitschrift.

[39]  Harald Niederreiter,et al.  Probability and computing: randomized algorithms and probabilistic analysis , 2006, Math. Comput..

[40]  M. Stratton Exploring the Genomes of Cancer Cells: Progress and Promise , 2011, Science.

[41]  Ole N Jensen,et al.  Metastasis-related Plasma Membrane Proteins of Human Breast Cancer Cells Identified by Comparative Quantitative Mass Spectrometry* , 2009, Molecular & Cellular Proteomics.

[42]  Kim R. Kampen,et al.  Membrane Proteins: The Key Players of a Cancer Cell , 2011, The Journal of Membrane Biology.

[43]  M. Newton Large-Scale Simultaneous Hypothesis Testing: The Choice of a Null Hypothesis , 2008 .

[44]  Caroline F Finch,et al.  Statistical modelling for falls count data. , 2010, Accident; analysis and prevention.

[45]  Ram C. Tripathi,et al.  Score Test for Zero Inflated Generalized Poisson Regression Model , 2005 .

[46]  E. Rowinsky,et al.  Signal events: Cell signal transduction and its inhibition in cancer. , 2003, The oncologist.

[47]  Isaac Dialsingh,et al.  Large-scale inference: empirical Bayes methods for estimation, testing, and prediction , 2012 .

[48]  D. Spandidos,et al.  The role of oncogenic kinases in human cancer (Review). , 2000, International journal of molecular medicine.

[49]  Thomas A. Peterson,et al.  A protein domain-centric approach for the comparative analysis of human and yeast phenotypically relevant mutations , 2013, BMC Genomics.

[50]  Junyong Park,et al.  Estimation of empirical null using a mixture of normals and its use in local false discovery rate , 2011, Comput. Stat. Data Anal..

[51]  B. M. Golam Kibria,et al.  Applications of some discrete regression models for count data , 2006 .

[52]  Hanns Hatt,et al.  Activation of an Olfactory Receptor Inhibits Proliferation of Prostate Cancer Cells* , 2009, The Journal of Biological Chemistry.