Estimation of Distribution Algorithms as Logistic Regression Regularizers of Microarray Classifiers

OBJECTIVES The "large k (genes), small N (samples)" phenomenon complicates the problem of microarray classification with logistic regression. The indeterminacy of the maximum likelihood solutions, multicollinearity of predictor variables and data over-fitting cause unstable parameter estimates. Moreover, computational problems arise due to the large number of predictor (genes) variables. Regularized logistic regression excels as a solution. However, the difficulties found here involve an objective function hard to be optimized from a mathematical viewpoint and a careful required tuning of the regularization parameters. METHODS Those difficulties are tackled by introducing a new way of regularizing the logistic regression. Estimation of distribution algorithms (EDAs), a kind of evolutionary algorithms, emerge as natural regularizers. Obtaining the regularized estimates of the logistic classifier amounts to maximizing the likelihood function via our EDA, without having to be penalized. Likelihood penalties add a number of difficulties to the resulting optimization problems, which vanish in our case. Simulation of new estimates during the evolutionary process of EDAs is performed in such a way that guarantees their shrinkage while maintaining their probabilistic dependence relationships learnt. The EDA process is embedded in an adapted recursive feature elimination procedure, thereby providing the genes that are best markers for the classification. RESULTS The consistency with the literature and excellent classification performance achieved with our algorithm are illustrated on four microarray data sets: Breast , Colon , Leukemia and Prostate . Details on the last two data sets are available as supplementary material. CONCLUSIONS We have introduced a novel EDA-based logistic regression regularizer. It implicitly shrinks the coefficients during EDA evolution process while optimizing the usual likelihood function. The approach is combined with a gene subset selection procedure and automatically tunes the required parameters. Empirical results on microarray data sets provide sparse models with confirmed genes and performing better in classification than other competing regularized methods.

[1]  A. L. Latner,et al.  Urine cyclic nucleotide concentrations in cancer and other conditions; cyclic GMP: a potential marker for cancer treatment. , 1982, Journal of clinical pathology.

[2]  M. Silvapulle,et al.  Ridge estimation in logistic regression , 1988 .

[3]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[4]  C. Robert Kenley,et al.  Gaussian influence diagrams , 1989 .

[5]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[6]  T. Hastie,et al.  Classification of gene microarrays by penalized logistic regression. , 2004, Biostatistics.

[7]  Keith Baggerly,et al.  Transcriptomic changes in human breast cancer progression as determined by serial analysis of gene expression , 2004, Breast Cancer Research.

[8]  Pedro Larrañaga,et al.  Optimization in Continuous Domains by Learning and Simulation of Gaussian Networks , 2000 .

[9]  R Spang,et al.  Molecular Diagnosis , 2005, Methods of Information in Medicine.

[10]  G. D'aiuto,et al.  HMGA1 Protein Overexpression in Human Breast Carcinomas , 2004, Clinical Cancer Research.

[11]  Tyson A. Clark,et al.  Alternative splicing and differential gene expression in colon cancer detected by a whole genome exon array , 2006, BMC Genomics.

[12]  Cesare Furlanello,et al.  Entropy-based gene ranking without selection bias for the predictive classification of microarray data , 2003, BMC Bioinformatics.

[13]  Concha Bielza,et al.  Machine Learning in Bioinformatics , 2008, Encyclopedia of Database Systems.

[14]  Lucila Ohno-Machado,et al.  Multivariate selection of genetic markers in diagnostic classification , 2004, Artif. Intell. Medicine.

[15]  T. Minka A comparison of numerical optimizers for logistic regression , 2004 .

[16]  Li Shen,et al.  Dimension reduction-based penalized logistic regression for cancer classification using microarray data , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[17]  Gavin C. Cawley,et al.  Gene Selection in Cancer Classification using Sparse Logistic Regression with Bayesian Regularisation , 2006 .

[18]  J. A. Lozano,et al.  Towards a New Evolutionary Computation: Advances on Estimation of Distribution Algorithms (Studies in Fuzziness and Soft Computing) , 2006 .

[19]  Stephen P. Boyd,et al.  An Interior-Point Method for Large-Scale l1-Regularized Logistic Regression , 2007, J. Mach. Learn. Res..

[20]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[21]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[22]  Paul H. C. Eilers,et al.  Classification of microarray data with penalized logistic regression , 2001, SPIE BiOS.

[23]  K. Helin,et al.  Members of the heat-shock protein 70 family promote cancer cell growth by distinct mechanisms. , 2005, Genes & development.

[24]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[25]  S. Sathiya Keerthi,et al.  A simple and efficient algorithm for gene selection using sparse logistic regression , 2003, Bioinform..

[26]  Ronald A. Thisted,et al.  Elements of statistical computing , 1986 .

[27]  Pedro Larrañaga,et al.  Mathematical modelling of UMDAc algorithm with tournament selection. Behaviour on linear and quadratic functions , 2002, Int. J. Approx. Reason..

[28]  M Dugas,et al.  A Generic Concept for Large-scale Microarray Analysis Dedicated to Medical Diagnostics , 2006, Methods of Information in Medicine.

[29]  Yusuke Nakamura,et al.  Molecular diagnosis of colorectal tumors by expression profiles of 50 genes expressed differentially in adenomas and carcinomas , 2002, Oncogene.

[30]  Jian Huang,et al.  Regularized ROC method for disease classification and biomarker selection with microarray data , 2005, Bioinform..

[31]  B S Gerber,et al.  Selection of Predictor Variables for Pneumonia Using Neural Networks and Genetic Algorithms , 2005, Methods of Information in Medicine.

[32]  G. Tian,et al.  Statistical Applications in Genetics and Molecular Biology Sparse Logistic Regression with Lp Penalty for Biomarker Identification , 2011 .

[33]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[34]  Lawrence Carin,et al.  Sparse multinomial logistic regression: fast algorithms and generalization bounds , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[36]  C. Bielza,et al.  Optimizing logistic regression coefficients for discrimination and calibration using estimation of distribution algorithms , 2008 .

[37]  Pedro Larrañaga,et al.  Filter versus wrapper gene selection approaches in DNA microarray domains , 2004, Artif. Intell. Medicine.

[38]  Edward R. Dougherty,et al.  Is cross-validation valid for small-sample microarray classification? , 2004, Bioinform..

[39]  S. Sathiya Keerthi,et al.  A Fast Dual Algorithm for Kernel Logistic Regression , 2002, 2007 International Joint Conference on Neural Networks.

[40]  S. Andò,et al.  Expression of nuclear insulin receptor substrate 1 in breast cancer , 2006, Journal of Clinical Pathology.

[41]  Gersende Fort,et al.  Classification using partial least squares with penalized logistic regression , 2005, Bioinform..