Adaptive multinomial regression with overlapping groups for multi-class classification of lung cancer

Multi-class classification has attracted much attention in cancer diagnosis and treatment and many machine learning methods have emerged for addressing this issue recently. However, class imbalance and gene selection problems occur in classifying lung cancer data. In this paper, an adaptive multinomial regression with a sparse overlapping group lasso penalty is proposed to perform classification and grouped gene selection for lung cancer gene expression data. An overlapped grouping strategy with biological interpretability is proposed, which highlights the importance of gene groups from the minority classes. By using the conditional mutual information, the gene significance within each group is evaluated and the data-driven weights are constructed. Based on the grouping strategy and constructed weights, a regularized adaptive multinomial regression is presented and the solving algorithm is developed, which can not only select the important gene groups for each class in performing multi-class classification, but also adaptively select important genes within each group. The experiment results show that the proposed method significantly outperforms the other 6 methods on classification accuracy, and the selected genes are disease-causing genes for lung cancer.

[1]  Xin Yao,et al.  Resampling-Based Ensemble Methods for Online Class Imbalance Learning , 2015, IEEE Transactions on Knowledge and Data Engineering.

[2]  Yong Xu,et al.  RPCA-Based Tumor Classification Using Gene Expression Data , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[3]  Juntao Li,et al.  Weighted doubly regularized support vector machine and its application to microarray classification with noise , 2016, Neurocomputing.

[4]  Hui Huang,et al.  Toward an optimal kernel extreme learning machine using a chaotic moth-flame optimization strategy with applications in medical diagnoses , 2017, Neurocomputing.

[5]  Zhen Liu,et al.  A hybrid method based on ensemble WELM for handling multi class imbalance in cancer microarray data , 2017, Neurocomputing.

[6]  Gavin Brown,et al.  Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection , 2012, J. Mach. Learn. Res..

[7]  S. Horvath,et al.  A General Framework for Weighted Gene Co-Expression Network Analysis , 2005, Statistical applications in genetics and molecular biology.

[8]  Keun Ho Ryu,et al.  Multiclass cancer classification using a feature subset-based ensemble from microRNA expression profiles , 2017, Comput. Biol. Medicine.

[9]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[10]  Steve Horvath,et al.  WGCNA: an R package for weighted correlation network analysis , 2008, BMC Bioinformatics.

[11]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[12]  Yong Fan,et al.  Feature selection by optimizing a lower bound of conditional mutual information , 2017, Inf. Sci..

[13]  Lin Zhao,et al.  Distributed adaptive fixed-time consensus tracking for second-order multi-agent systems using modified terminal sliding mode , 2017, Appl. Math. Comput..

[14]  Xuekun Song,et al.  Grouped gene selection and multi-classification of acute leukemia via new regularized multinomial regression. , 2018, Gene.

[15]  Muhammad Hisyam Lee,et al.  Regularized logistic regression with adjusted adaptive elastic net for gene selection in high dimensional cancer classification , 2015, Comput. Biol. Medicine.

[16]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[17]  Noah Simon,et al.  A Sparse-Group Lasso , 2013 .

[18]  Rong Liu,et al.  Network-based approach to identify prognostic biomarkers for estrogen receptor–positive breast cancer treatment with tamoxifen , 2015, Cancer biology & therapy.

[19]  Xing-Ming Zhao,et al.  Inferring gene regulatory networks from gene expression data by path consistency algorithm based on conditional mutual information , 2012, Bioinform..

[20]  Xiaoting Wang,et al.  Conditional mutual information and quantum steering , 2016, 1612.03875.

[21]  Kay Chen Tan,et al.  Evolutionary Cluster-Based Synthetic Oversampling Ensemble (ECO-Ensemble) for Imbalance Learning , 2017, IEEE Transactions on Cybernetics.

[22]  Christopher I. Amos,et al.  Gene set selection via LASSO penalized regression (SLPR) , 2017, Nucleic acids research.

[23]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[24]  Shannon L. Risacher,et al.  A novel SCCA approach via truncated ℓ1-norm and truncated group lasso for brain imaging genetics , 2017, Bioinform..

[25]  Xin Yao,et al.  MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning , 2014 .

[26]  Hung-Wen Chiu,et al.  Cancer subtype prediction from a pathway-level perspective by using a support vector machine based on integrated gene expression and protein network , 2017, Comput. Methods Programs Biomed..

[27]  Niels Richard Hansen,et al.  Sparse group lasso and high dimensional multinomial classification , 2012, Comput. Stat. Data Anal..

[28]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[29]  Antônio de Pádua Braga,et al.  Novel Cost-Sensitive Approach to Improve the Multilayer Perceptron Performance on Imbalanced Data , 2013, IEEE Transactions on Neural Networks and Learning Systems.

[30]  James J. Chen,et al.  Class-imbalanced classifiers for high-dimensional data , 2013, Briefings Bioinform..

[31]  Yanchun Liang,et al.  A resampling ensemble algorithm for classification of imbalance problems , 2014, Neurocomputing.

[32]  Lin Zhao,et al.  Adaptive Neural Consensus Tracking for Nonlinear Multiagent Systems Using Finite-Time Command Filtered Backstepping , 2018, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[33]  Safdar Ali,et al.  Can-CSC-GBE: Developing Cost-sensitive Classifier with Gentleboost Ensemble for breast cancer classification using protein amino acids and imbalanced data , 2016, Comput. Biol. Medicine.

[34]  Shigeru Katagiri,et al.  Confusion-Matrix-Based Kernel Logistic Regression for Imbalanced Data Classification , 2017, IEEE Transactions on Knowledge and Data Engineering.

[35]  Akin Ozçift,et al.  Random forests ensemble classifier trained with data resampling strategy to improve cardiac arrhythmia diagnosis. , 2011, Computers in biology and medicine.

[36]  Yue-Shi Lee,et al.  Cluster-based under-sampling approaches for imbalanced data distributions , 2009, Expert Syst. Appl..