Incorporating prior knowledge of predictors into penalized classifiers with multiple penalty terms

MOTIVATION In the context of sample (e.g. tumor) classifications with microarray gene expression data, many methods have been proposed. However, almost all the methods ignore existing biological knowledge and treat all the genes equally a priori. On the other hand, because some genes have been identified by previous studies to have biological functions or to be involved in pathways related to the outcome (e.g. cancer), incorporating this type of prior knowledge into a classifier can potentially improve both the predictive performance and interpretability of the resulting model. RESULTS We propose a simple and general framework to incorporate such prior knowledge into building a penalized classifier. As two concrete examples, we apply the idea to two penalized classifiers, nearest shrunken centroids (also called PAM) and penalized partial least squares (PPLS). Instead of treating all the genes equally a priori as in standard penalized methods, we group the genes according to their functional associations based on existing biological knowledge or data, and adopt group-specific penalty terms and penalization parameters. Simulated and real data examples demonstrate that, if prior knowledge on gene grouping is indeed informative, our new methods perform better than the two standard penalized methods, yielding higher predictive accuracy and screening out more irrelevant genes.

[1]  Michael A. Siani-Rose,et al.  A Knowledge-Based Clustering Algorithm Driven by Gene Ontology , 2004, Journal of biopharmaceutical statistics.

[2]  E. Lander,et al.  Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Trevor Hastie,et al.  Class Prediction by Nearest Shrunken Centroids, with Applications to DNA Microarrays , 2003 .

[4]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[5]  References , 1971 .

[6]  Hongyu Zhao,et al.  Pathway analysis using random forests classification and regression , 2006, Bioinform..

[7]  M. West,et al.  Gene expression predictors of breast cancer outcomes , 2003, The Lancet.

[8]  Lei Liu,et al.  Knowledge guided analysis of microarray data , 2006, J. Biomed. Informatics.

[9]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[10]  Chris Sander,et al.  CancerGenes: a gene selection resource for cancer genome projects , 2006, Nucleic Acids Res..

[11]  P. Garthwaite An Interpretation of Partial Least Squares , 1994 .

[12]  L. Ohno-Machado Journal of Biomedical Informatics , 2001 .

[13]  L. Breiman Better subset regression using the nonnegative garrote , 1995 .

[14]  Joaquín Dopazo,et al.  Discovering molecular functions significantly related to phenotypes by combining gene expression data and biological information , 2005, Bioinform..

[15]  Wei Pan,et al.  Incorporating Biological Information as a Prior in an Empirical Bayes Approach to Analyzing Microarray Data , 2005, Statistical applications in genetics and molecular biology.

[16]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[17]  Rainer Spang,et al.  Molecular decomposition of complex clinical phenotypes using biologically structured analysis of microarray data , 2005, Bioinform..

[18]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[19]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[20]  Wei Pan,et al.  Linear regression and two-class classification with gene expression data , 2003, Bioinform..

[21]  J. Friedman,et al.  A Statistical View of Some Chemometrics Regression Tools , 1993 .

[22]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[23]  Wei Pan,et al.  Bioinformatics Original Paper Incorporating Gene Functions as Priors in Model-based Clustering of Microarray Gene Expression Data , 2022 .

[24]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[25]  Alan R. Dabney BIOINFORMATICS Classification of Microarrays to Nearest Centroids , 2022 .

[26]  J. Foekens,et al.  Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer , 2005, The Lancet.

[27]  J. Welsh,et al.  Analysis of gene expression identifies candidate markers and pharmacological targets in prostate cancer. , 2001, Cancer research.

[28]  M. Yuan,et al.  On the non‐negative garrotte estimator , 2007 .

[29]  J. Dopazo Functional interpretation of microarray experiments. , 2006, Omics : a journal of integrative biology.

[30]  Lance D. Miller,et al.  Identifying gene expression changes in breast cancer that distinguish early and late relapse among uncured patients , 2006, Bioinform..

[31]  Minoru Kanehisa,et al.  Toward Pathway Engineering: A New Database of Genetic and Molecular Pathways , 1997 .

[32]  Hongzhe Li,et al.  Nonparametric pathway-based regression models for analysis of genomic data. , 2007, Biostatistics.

[33]  PanWei,et al.  Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data , 2006 .

[34]  T. Cai Adaptive wavelet estimation : A block thresholding and oracle inequality approach , 1999 .

[35]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[36]  Danh V. Nguyen,et al.  Tumor classification by partial least squares using microarray gene expression data , 2002, Bioinform..

[37]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[38]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[39]  Wei Pan,et al.  Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data , 2006, Bioinform..

[40]  Jiang Gui,et al.  Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data , 2005, Bioinform..

[41]  J. Cavanaugh Biostatistics , 2005, Definitions.

[42]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.