Regularized logistic regression without a penalty term: An application to cancer classification with microarray data

Research highlights? EDAs can be used to find regularized logistic classifiers. It avoids the determination of the regularization term. ? EDA is not influenced by large number of covariates. ? Yields to significant better performance on AUC measure, compared to ridge and Lasso logistic regressions. Regularized logistic regression is a useful classification method for problems with few samples and a huge number of variables. This regression needs to determine the regularization term, which amounts to searching for the optimal penalty parameter and the norm of the regression coefficient vector. This paper presents a new regularized logistic regression method based on the evolution of the regression coefficients using estimation of distribution algorithms. The main novelty is that it avoids the determination of the regularization term. The chosen simulation method of new coefficients at each step of the evolutionary process guarantees their shrinkage as an intrinsic regularization. Experimental results comparing the behavior of the proposed method with Lasso and ridge logistic regression in three cancer classification problems with microarray data are shown.

[1]  Pedro Larrañaga,et al.  Estimation of Distribution Algorithms , 2002, Genetic Algorithms and Evolutionary Computation.

[2]  Christian Böhm,et al.  Supervised machine learning techniques for the classification of metabolic disorders in newborns , 2004, Bioinform..

[3]  Wenjiang J. Fu Penalized Regressions: The Bridge versus the Lasso , 1998 .

[4]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[5]  Lucila Ohno-Machado,et al.  A genetic algorithm to select variables in logistic regression: example in the domain of myocardial infarction , 1999, AMIA.

[6]  Danh V. Nguyen,et al.  Tumor classification by partial least squares using microarray gene expression data , 2002, Bioinform..

[7]  J. Friedman,et al.  [A Statistical View of Some Chemometrics Regression Tools]: Response , 1993 .

[8]  Ross Ihaka,et al.  Gentleman R: R: A language for data analysis and graphics , 1996 .

[9]  T. Hastie,et al.  Classification of gene microarrays by penalized logistic regression. , 2004, Biostatistics.

[10]  Lawrence K. Saul,et al.  Multiplicative Updates for L1-Regularized Linear and Logistic Regression , 2007, IDA.

[11]  Mehryar Mohri,et al.  AUC Optimization vs. Error Rate Minimization , 2003, NIPS.

[12]  Stanley Lemeshow,et al.  Applied Logistic Regression, Second Edition , 1989 .

[13]  Pedro Larrañaga,et al.  Optimization in Continuous Domains by Learning and Simulation of Gaussian Networks , 2000 .

[14]  David Madigan,et al.  Algorithms for Sparse Linear Classifiers in the Massive Data Setting , 2008 .

[15]  Lucila Ohno-Machado,et al.  Multivariate selection of genetic markers in diagnostic classification , 2004, Artif. Intell. Medicine.

[16]  Satoru Miyano,et al.  Case-control study of binary disease trait considering interactions between SNPs and environmental effects using logistic regression , 2004, Proceedings. Fourth IEEE Symposium on Bioinformatics and Bioengineering.

[17]  Lucila Ohno-Machado,et al.  The use of receiver operating characteristic curves in biomedical informatics , 2005, J. Biomed. Informatics.

[18]  J. Friedman,et al.  A Statistical View of Some Chemometrics Regression Tools , 1993 .

[19]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[20]  J. A. Lozano,et al.  Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation , 2001 .

[21]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[22]  R. Tibshirani,et al.  Efficient quadratic regularization for expression arrays. , 2004, Biostatistics.

[23]  M. Silvapulle,et al.  Ridge estimation in logistic regression , 1988 .

[24]  Sujuan Gao,et al.  Asymptotic properties of a double penalized maximum likelihood estimator in logistic regression , 2007 .

[25]  A. Ng Feature selection, L1 vs. L2 regularization, and rotational invariance , 2004, Twenty-first international conference on Machine learning - ICML '04.

[26]  Honglak Lee,et al.  Efficient L1 Regularized Logistic Regression , 2006, AAAI.

[27]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[28]  Ronald A. Thisted,et al.  Elements of statistical computing , 1986 .

[29]  G. Tian,et al.  Statistical Applications in Genetics and Molecular Biology Sparse Logistic Regression with Lp Penalty for Biomarker Identification , 2011 .

[30]  P. Bühlmann,et al.  The group lasso for logistic regression , 2008 .

[31]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[32]  Melody Y. Kiang,et al.  A comparative assessment of classification methods , 2003, Decis. Support Syst..

[33]  Jae Won Lee,et al.  An extensive comparison of recent classification tools applied to microarray data , 2004, Comput. Stat. Data Anal..

[34]  Pedro Larrañaga,et al.  Filter versus wrapper gene selection approaches in DNA microarray domains , 2004, Artif. Intell. Medicine.

[35]  Özge Uncu,et al.  A novel feature selection approach: Combining feature wrappers and filters , 2007, Inf. Sci..

[36]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[37]  D. Firth Bias reduction of maximum likelihood estimates , 1993 .

[38]  Paul H. C. Eilers,et al.  Classification of microarray data with penalized logistic regression , 2001, SPIE BiOS.

[39]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[40]  David Madigan,et al.  Large-Scale Bayesian Logistic Regression for Text Categorization , 2007, Technometrics.

[41]  S. Geer,et al.  Regularization in statistics , 2006 .

[42]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[43]  J. G. Liao,et al.  Logistic regression for disease classification using microarray data: model selection in a large p and small n case , 2007, Bioinform..

[44]  Gersende Fort,et al.  Classification Using Partial Least Squares with Penalized Logistic Regression , 2004 .

[45]  Stephen P. Boyd,et al.  An Interior-Point Method for Large-Scale l1-Regularized Logistic Regression , 2007, J. Mach. Learn. Res..

[46]  Runze Li,et al.  Statistical Challenges with High Dimensionality: Feature Selection in Knowledge Discovery , 2006, math/0602133.

[47]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[48]  S. Sathiya Keerthi,et al.  A Fast Dual Algorithm for Kernel Logistic Regression , 2002, 2007 International Joint Conference on Neural Networks.

[49]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[50]  B. Efron Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation , 1983 .

[51]  Peng Zhao,et al.  Stagewise Lasso , 2007, J. Mach. Learn. Res..

[52]  Ana M. Aguilera,et al.  Using principal components for estimating logistic regression with high-dimensional multicollinear data , 2006, Comput. Stat. Data Anal..

[53]  Jian Huang,et al.  Regularized ROC method for disease classification and biomarker selection with microarray data , 2005, Bioinform..

[54]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[55]  Mee Young Park,et al.  L 1-regularization path algorithm for generalized linear models , 2006 .

[56]  Pedro Larrañaga,et al.  Mathematical modelling of UMDAc algorithm with tournament selection. Behaviour on linear and quadratic functions , 2002, Int. J. Approx. Reason..

[57]  Yang Jing L1 Regularization Path Algorithm for Generalized Linear Models , 2008 .

[58]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[59]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[60]  Martin Pelikan,et al.  Hierarchical Bayesian optimization algorithm: toward a new generation of evolutionary algorithms , 2010, SICE 2003 Annual Conference (IEEE Cat. No.03TH8734).

[61]  Charles X. Ling,et al.  Using AUC and accuracy in evaluating learning algorithms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[62]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[63]  Edward R. Dougherty,et al.  Is cross-validation valid for small-sample microarray classification? , 2004, Bioinform..

[64]  S. Sathiya Keerthi,et al.  A simple and efficient algorithm for gene selection using sparse logistic regression , 2003, Bioinform..

[65]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[66]  R. Tibshirani,et al.  Sparsity and smoothness via the fused lasso , 2005 .

[67]  Sophie Lambert-Lacroix,et al.  Effective dimension reduction methods for tumor classification using gene expression data , 2003, Bioinform..

[68]  J. A. Lozano,et al.  Towards a New Evolutionary Computation: Advances on Estimation of Distribution Algorithms (Studies in Fuzziness and Soft Computing) , 2006 .

[69]  D. Hosmer,et al.  Applied Logistic Regression , 1991 .

[70]  Li Shen,et al.  Dimension reduction-based penalized logistic regression for cancer classification using microarray data , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[71]  Gavin C. Cawley,et al.  Gene Selection in Cancer Classification using Sparse Logistic Regression with Bayesian Regularisation , 2006 .

[72]  Hiroshi Motoda,et al.  Computational Methods of Feature Selection , 2007 .