Model-free feature screening for categorical outcomes: Nonlinear effect detection and false discovery rate control

Feature screening has become a real prerequisite for the analysis of high-dimensional genomic data, as it is effective in reducing dimensionality and removing redundant features. However, existing methods for feature screening have been mostly relying on the assumptions of linear effects and independence (or weak dependence) between features, which might be inappropriate in real practice. In this paper, we consider the problem of selecting continuous features for a categorical outcome from high-dimensional data. We propose a powerful statistical procedure that consists of two steps, a nonparametric significance test based on edge count and a multiple testing procedure with dependence adjustment for false discovery rate control. The new method presents two novelties. First, the edge-count test directly targets distributional difference between groups, therefore it is sensitive to nonlinear effects. Second, we relax the independence assumption and adapt Efron’s procedure to adjust for the dependence between features. The performance of the proposed procedure, in terms of statistical power and false discovery rate, is illustrated by simulated data. We apply the new method to three genomic datasets to identify genes associated with colon, cervical and prostate cancers.

[1]  D. Schadendorf,et al.  Metastatic potential of melanomas defined by specific gene expression profiles with no BRAF signature. , 2006, Pigment cell research.

[2]  Xihong Lin,et al.  Variable selection and estimation in generalized linear models with the seamless ${\it L}_{{\rm 0}}$ penalty , 2012, The Canadian journal of statistics = Revue canadienne de statistique.

[3]  John T. Wei,et al.  Integrative molecular concept modeling of prostate cancer progression , 2007, Nature Genetics.

[4]  Maria L. Rizzo,et al.  Measuring and testing dependence by correlation of distances , 2007, 0803.4101.

[5]  Mingzhu Zhu,et al.  MEGO: gene functional module expression based on gene ontology. , 2005, BioTechniques.

[6]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[7]  Christina Backes,et al.  GeneTrail—advanced gene set enrichment analysis , 2007, Nucleic Acids Res..

[8]  Qingyang Zhang,et al.  Integrative network analysis of TCGA data for ovarian cancer , 2014, BMC Systems Biology.

[9]  Weidong Liu,et al.  Two‐sample test of high dimensional means under dependence , 2014 .

[10]  Lipo Wang,et al.  A Modified T-test Feature Selection Method and Its Application on the HapMap Genotype Data , 2008, Genom. Proteom. Bioinform..

[11]  Steven J. M. Jones,et al.  Comprehensive molecular profiling of lung adenocarcinoma , 2014, Nature.

[12]  Č. Vlček,et al.  Melanoma cells influence the differentiation pattern of human epidermal keratinocytes , 2015, Molecular Cancer.

[13]  Peng Xiao,et al.  Hotelling’s T 2 multivariate profiling for detecting differential expression in microarrays , 2005 .

[14]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Erchin Serpedin,et al.  Reducing confounding and suppression effects in TCGA data: an integrated analysis of chemotherapy response in ovarian cancer , 2012, BMC Genomics.

[16]  Louis H. Y. Chen,et al.  Stein's method for normal approximation , 2005 .

[17]  Sayan Mukherjee,et al.  Modeling Cancer Progression via Pathway Dependencies , 2008, PLoS Comput. Biol..

[18]  Benjamin J. Raphael,et al.  Integrated Genomic Analyses of Ovarian Carcinoma , 2011, Nature.

[19]  M. Roudbaraki,et al.  Evidence of functional ryanodine receptor involved in apoptosis of prostate cancer (LNCaP) cells , 2000, The Prostate.

[20]  P. Rosenbaum An exact distribution‐free test comparing two multivariate distributions based on adjacency , 2005 .

[21]  Jun Zhang,et al.  Robust rank correlation based screening , 2010, 1012.4255.

[22]  Weidong Liu Structural similarity and difference testing on multiple sparse Gaussian graphical models , 2017 .

[23]  Veerabhadran Baladandayuthapani,et al.  A Two-Sample Test for Equality of Means in High Dimension , 2015, Journal of the American Statistical Association.

[24]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumours , 2013 .

[25]  Xing-Ming Zhao,et al.  Inferring gene regulatory networks from gene expression data by path consistency algorithm based on conditional mutual information , 2012, Bioinform..

[26]  Peter R Hobson,et al.  Computationally efficient algorithms for the two-dimensional Kolmogorov–Smirnov test , 2008 .

[27]  J Gertheiss,et al.  Variable selection in generalized functional linear models , 2013, Stat.

[28]  R. Tibshirani,et al.  Gene expression profiling identifies clinically relevant subtypes of prostate cancer. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[29]  Robert E. Tarjan,et al.  Finding Minimum Spanning Trees , 1976, SIAM J. Comput..

[30]  Paul Theodor Pyl,et al.  HTSeq—a Python framework to work with high-throughput sequencing data , 2014, bioRxiv.

[31]  Hu Yang,et al.  Robust variable selection for generalized linear models with a diverging number of parameters , 2017 .

[32]  Hui Liu,et al.  Detection of type 2 diabetes related modules and genes based on epigenetic networks , 2014, BMC Systems Biology.

[33]  Qingyang Zhang,et al.  A graph-based multi-sample test for identifying pathways associated with cancer progression , 2020, Comput. Biol. Chem..

[34]  Bo Zhang,et al.  Mathematical modelling of interacting mechanisms for hypoxia mediated cell cycle commitment for mesenchymal stromal cells , 2018, BMC Systems Biology.

[35]  Z. Werb,et al.  The extracellular matrix: A dynamic niche in cancer progression , 2012, The Journal of cell biology.

[36]  Dag Tjøstheim,et al.  NOTES AND CORRESPONDENCE A Cautionary Note on the Use of the Kolmogorov-Smirnov Test for Normality , 2007 .

[37]  Jerome H. Friedman,et al.  A New Graph-Based Two-Sample Test for Multivariate and Object Data , 2013, 1307.6294.

[38]  Gábor J. Székely,et al.  The distance correlation t-test of independence in high dimension , 2013, J. Multivar. Anal..

[39]  R. Salunga,et al.  Gene expression profiles of human breast cancer progression , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[40]  Michael A. Newton Introducing the discussion paper by Sz\'{e}kely and Rizzo , 2010 .

[41]  Yixin Wang,et al.  Novel Genes Associated with Malignant Melanoma but not Benign Melanocytic Lesions , 2005, Clinical Cancer Research.

[42]  J. Friedman,et al.  Multivariate generalizations of the Wald--Wolfowitz and Smirnov two-sample tests , 1979 .

[43]  E. Pikarsky,et al.  Vav1 promotes lung cancer growth by instigating tumor-microenvironment cross-talk via growth factor secretion , 2014, Oncotarget.

[44]  Weidong Liu Gaussian graphical model estimation with false discovery rate control , 2013, 1306.0976.

[45]  Peter J. Woolf,et al.  GAGE: generally applicable gene set enrichment for pathway analysis , 2009, BMC Bioinformatics.

[46]  H. Crutcher A Note on the Possible Misuse of the Kolmogorov-Smirnov Test , 1975 .

[47]  Paul Pavlidis,et al.  ErmineJ: Tool for functional analysis of gene expression data sets , 2005, BMC Bioinformatics.

[48]  K. Hoek,et al.  Whole-genome expression profiling of the melanoma progression pathway reveals marked molecular differences between nevi/melanoma in situ and advanced-stage melanomas , 2005, Cancer biology & therapy.

[49]  Patrik Edén,et al.  Comparing Functional Annotation Analyses with Catmap Comparing Functional Annotation Analyses with Catmap , 2004 .

[50]  Muni S. Srivastava,et al.  A two sample test in high dimensional data , 2013, Journal of Multivariate Analysis.

[51]  Alfonso Valencia,et al.  EnrichNet: network-based gene set enrichment analysis , 2012, Bioinform..

[52]  Joshy George,et al.  Genetic reclassification of histologic grade delineates new clinical subtypes of breast cancer. , 2006, Cancer research.

[53]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[54]  Seon-Young Kim,et al.  PAGE: Parametric Analysis of Gene Set Enrichment , 2005, BMC Bioinform..