Sparse logistic regression with a L1/2 penalty for gene selection in cancer classification

BackgroundMicroarray technology is widely used in cancer diagnosis. Successfully identifying gene biomarkers will significantly help to classify different cancer types and improve the prediction accuracy. The regularization approach is one of the effective methods for gene selection in microarray data, which generally contain a large number of genes and have a small number of samples. In recent years, various approaches have been developed for gene selection of microarray data. Generally, they are divided into three categories: filter, wrapper and embedded methods. Regularization methods are an important embedded technique and perform both continuous shrinkage and automatic gene selection simultaneously. Recently, there is growing interest in applying the regularization techniques in gene selection. The popular regularization technique is Lasso (L1), and many L1 type regularization terms have been proposed in the recent years. Theoretically, the Lq type regularization with the lower value of q would lead to better solutions with more sparsity. Moreover, the L1/2 regularization can be taken as a representative of Lq (0 < q < 1) regularizations and has been demonstrated many attractive properties.ResultsIn this work, we investigate a sparse logistic regression with the L1/2 penalty for gene selection in cancer classification problems, and propose a coordinate descent algorithm with a new univariate half thresholding operator to solve the L1/2 penalized logistic regression. Experimental results on artificial and microarray data demonstrate the effectiveness of our proposed approach compared with other regularization methods. Especially, for 4 publicly available gene expression datasets, the L1/2 regularization method achieved its success using only about 2 to 14 predictors (genes), compared to about 6 to 38 genes for ordinary L1 and elastic net regularization approaches.ConclusionsFrom our evaluations, it is clear that the sparse logistic regression with the L1/2 penalty achieves higher classification accuracy than those of ordinary L1 and elastic net regularization approaches, while fewer but informative genes are selected. This is an important consideration for screening and diagnostic applications, where the goal is often to develop an accurate test using as few features as possible in order to control cost. Therefore, the sparse logistic regression with the L1/2 penalty is effective technique for gene selection in real classification problems.

[1]  U. Alon,et al.  Transcriptional gene expression profiles of colorectal adenoma, adenocarcinoma, and normal tissue examined by oligonucleotide arrays. , 2001, Cancer research.

[2]  Qing-Song Xu,et al.  Random frog: an efficient reversible jump Markov Chain Monte Carlo-like approach for variable selection with applications to gene selection and disease classification. , 2012, Analytica chimica acta.

[3]  M. Zahalka,et al.  PLIF, a Novel Human Ferritin Subunit from Placenta with Immunosuppressive Activity* , 2002, The Journal of Biological Chemistry.

[4]  K. Shailubhai,et al.  Uroguanylin treatment suppresses polyp formation in the Apc(Min/+) mouse and induces apoptosis in human colon adenocarcinoma cells via cyclic GMP. , 2000, Cancer research.

[5]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Jianhua Dai,et al.  Attribute selection based on information gain ratio in fuzzy rough set theory with application to tumor classification , 2013, Appl. Soft Comput..

[7]  Jae Won Lee,et al.  An extensive comparison of recent classification tools applied to microarray data , 2004, Comput. Stat. Data Anal..

[8]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[9]  Yang Ai-jun,et al.  Bayesian variable selection for disease classification using gene expression data , 2010 .

[10]  R. Tibshirani,et al.  PATHWISE COORDINATE OPTIMIZATION , 2007, 0708.1485.

[11]  Tao Li,et al.  A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression , 2004, Bioinform..

[12]  Zongben Xu,et al.  $L_{1/2}$ Regularization: A Thresholding Representation Theory and a Fast Solver , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[13]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[14]  J. Nährig,et al.  Identification of gene signatures for invasive colorectal tumor cells. , 2007, Cancer detection and prevention.

[15]  Jianzhong Li,et al.  A stable gene selection in microarray data analysis , 2006, BMC Bioinformatics.

[16]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[17]  S. Sathiya Keerthi,et al.  A simple and efficient algorithm for gene selection using sparse logistic regression , 2003, Bioinform..

[18]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[19]  Michael I. Jordan,et al.  Minimax Probability Machine , 2001, NIPS.

[20]  Nir Friedman,et al.  Tissue classification with gene expression profiles , 2000, RECOMB '00.

[21]  Insuk Sohn,et al.  Gradient lasso for Cox proportional hazards model , 2009, Bioinform..

[22]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[23]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[24]  Xin-Yuan Song,et al.  Bayesian variable selection for disease classification using gene expression data , 2010, Bioinform..

[25]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[26]  Cun-Hui Zhang Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.

[27]  Gérard Dreyfus,et al.  Withdrawing an example from the training set: An analytic estimation of its effect on a non-linear parameterised model , 2000, Neurocomputing.

[28]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[29]  Wang Yao,et al.  L 1/2 regularization , 2010 .

[30]  Chris H. Q. Ding,et al.  Minimum redundancy feature selection from microarray gene expression data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[31]  Jie Gui,et al.  Tumor classification by combining PNN classifier ensemble with neighborhood rough set based gene reduction , 2010, Comput. Biol. Medicine.

[32]  Gavin C. Cawley,et al.  Gene Selection in Cancer Classification using Sparse Logistic Regression with Bayesian Regularisation , 2006 .

[33]  Léon Personnaz,et al.  MLPs (Mono-Layer Polynomials and Multi-Layer Perceptrons) for Nonlinear Modeling , 2003, J. Mach. Learn. Res..

[34]  Graziano Pesole,et al.  Selection of relevant genes in cancer diagnosis based on their prediction accuracy , 2007, Artif. Intell. Medicine.

[35]  Takayuki Harada,et al.  Cathepsin B and H activities and cystatin C concentrations in cerebrospinal fluid from patients with leptomeningeal metastasis. , 2003, Clinica chimica acta; international journal of clinical chemistry.