Penalized logistic regression with the adaptive LASSO for gene selection in high-dimensional cancer classification

The CBPLR showed superior results in terms of AUR and misclassification rate.In terms of the number of selected genes, the CBPLR outperformed APLR and LASSO.The CBPLR performed remarkably well in stability test.The classification accuracy for the CBPLR method is quite consistent and high. An important application of DNA microarray data is cancer classification. Because of the high-dimensionality problem of microarray data, gene selection approaches are often employed to support the expert systems in diagnostic capability of cancer with high classification accuracy. Penalized logistic regression using the least absolute shrinkage and selection operator (LASSO) is one of the key steps in high-dimensional cancer classification, as gene coefficient estimation and gene selection simultaneously. However, the LASSO has been criticized for being biased in gene selection. The adaptive LASSO (APLR) was originally proposed to overcome the selection bias by assigning a consistent weight to each gene. In high-dimensional data, however, the adaptive LASSO faces practical problems in choosing the type of initial weight. In practice, the LASSO estimator itself has been used as an initial weight. However, this may not be preferable because the LASSO is inconsistent in itself. To address this issue, an alternative initial weight in adaptive penalized logistic regression (CBPLR) is proposed. The effectiveness of the CBPLR is examined on three well-known high-dimensional cancer classification datasets using number of selected genes, area under the curve, and misclassification rate. The experimental results reveal that the proposed CBPLR is quite efficient and feasible for cancer classification. Additionally, the proposed weight is compared with APLR and LASSO and exhibits competitive performance in both classification accuracy and gene selection. The proposed CBPLR has significant impact in penalized logistic regression by selecting fewer genes with high area under the curve and low misclassification rate. Thus, the proposed weight could conceivably be used in other research that implements gene selection in the field of high dimensional cancer classification.

[1]  Weixiang Liu,et al.  An experimental comparison of gene selection by Lasso and Dantzig selector for cancer classification , 2011, Comput. Biol. Medicine.

[2]  Jian Yang,et al.  Sparse maximum margin discriminant analysis for feature extraction and gene selection on gene expression data , 2013, Comput. Biol. Medicine.

[3]  Jan Kalina,et al.  Classification methods for high-dimensional genetic data , 2014 .

[4]  G. Tian,et al.  Statistical Applications in Genetics and Molecular Biology Sparse Logistic Regression with Lp Penalty for Biomarker Identification , 2011 .

[5]  Niels Richard Hansen,et al.  Sparse group lasso and high dimensional multinomial classification , 2012, Comput. Stat. Data Anal..

[6]  Yanwen Chong,et al.  Gene selection using independent variable group analysis for tumor classification , 2011, Neural Computing and Applications.

[7]  Li Shen,et al.  Dimension reduction-based penalized logistic regression for cancer classification using microarray data , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[8]  Minrui Fei,et al.  A novel forward gene selection algorithm for microarray data , 2014, Neurocomputing.

[9]  ShenLi,et al.  Dimension Reduction-Based Penalized Logistic Regression for Cancer Classification Using Microarray Data , 2005 .

[10]  Yang Yang,et al.  Supervised feature learning via l2-norm regularized logistic regression for 3D object recognition , 2015, Neurocomputing.

[11]  Chia-Hua Ho,et al.  An improved GLMNET for l1-regularized logistic regression , 2011, J. Mach. Learn. Res..

[12]  Borut Peterlin,et al.  Rasch-based high-dimensionality data reduction and class prediction with applications to microarray gene expression data , 2010, Expert Syst. Appl..

[13]  Gersende Fort,et al.  Classification using partial least squares with penalized logistic regression , 2005, Bioinform..

[14]  Sijian Wang,et al.  RANDOM LASSO. , 2011, The annals of applied statistics.

[15]  Sunghoon Kwon,et al.  Multiclass sparse logistic regression for classification of multiple cancer types using gene expression data , 2006, Comput. Stat. Data Anal..

[16]  Kwong-Sak Leung,et al.  Sparse logistic regression with a L1/2 penalty for gene selection in cancer classification , 2013, BMC Bioinformatics.

[17]  Jinshan Liu,et al.  Optimal gene subset selection using the modified SFFS algorithm for tumor classification , 2012, Neural Computing and Applications.

[18]  Gerhard Tutz,et al.  Penalized regression with correlation-based penalty , 2009, Stat. Comput..

[19]  Mehmet Fatih Akay,et al.  Support vector machines combined with feature selection for breast cancer diagnosis , 2009, Expert Syst. Appl..

[20]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[22]  B. Chandra,et al.  An efficient statistical feature selection approach for classification of gene expression data , 2011, J. Biomed. Informatics.

[23]  Kun-Huang Chen,et al.  Applying particle swarm optimization-based decision tree classifier for cancer classification on gene expression data , 2014, Appl. Soft Comput..

[24]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[25]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[26]  Concha Bielza,et al.  Regularized logistic regression without a penalty term: An application to cancer classification with microarray data , 2011, Expert Syst. Appl..

[27]  Tiejun Tong,et al.  Gene Selection Using Iterative Feature Elimination Random Forests for Survival Outcomes , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[28]  Ata Kabán,et al.  Classification of mislabelled microarrays using robust sparse logistic regression , 2013, Bioinform..

[29]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[30]  Sara van de Geer,et al.  Statistics for High-Dimensional Data: Methods, Theory and Applications , 2011 .

[31]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[32]  Yixin Chen,et al.  Biomarker discovery using 1-norm regularization for multiclass earthworm microarray gene expression data , 2012, Neurocomputing.

[33]  Zhengyan Lin,et al.  Adaptive Lasso in high-dimensional settings , 2009 .

[34]  Carlos J. Alonso,et al.  Microarray gene expression classification with few genes: Criteria to combine attribute selection and classification methods , 2012, Expert Syst. Appl..

[35]  S. Sathiya Keerthi,et al.  A simple and efficient algorithm for gene selection using sparse logistic regression , 2003, Bioinform..

[36]  Keun Ho Ryu,et al.  An ensemble correlation-based gene selection algorithm for cancer classification with gene expression data , 2012, Bioinform..

[37]  Gavin C. Cawley,et al.  Gene Selection in Cancer Classification using Sparse Logistic Regression with Bayesian Regularisation , 2006 .

[38]  Jianqing Fan,et al.  ADAPTIVE ROBUST VARIABLE SELECTION. , 2012, Annals of statistics.

[39]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[40]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[41]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[42]  T. Hastie,et al.  Classification of gene microarrays by penalized logistic regression. , 2004, Biostatistics.

[43]  Mee Young Park,et al.  Penalized logistic regression for detecting gene interactions. , 2008, Biostatistics.

[44]  Xiaojian Yang,et al.  The LASSO and Sparse Least Squares Regression Methods for SNP Selection in Predicting Quantitative Traits , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[45]  Jian Huang,et al.  Penalized feature selection and classification in bioinformatics , 2008, Briefings Bioinform..

[46]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[47]  Shaoning Pang,et al.  Classification consistency analysis for bootstrapping gene selection , 2007, Neural Computing and Applications.

[48]  Yue Han,et al.  Stable Gene Selection from Microarray Data via Sample Weighting , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.