On the Performance of Variable Selection and Classification via Rank-Based Classifier

In high-dimensional gene expression data analysis, the accuracy and reliability of cancer classification and selection of important genes play a very crucial role. To identify these important genes and predict future outcomes (tumor vs. non-tumor), various methods have been proposed in the literature. But only few of them take into account correlation patterns and grouping effects among the genes. In this article, we propose a rank-based modification of the popular penalized logistic regression procedure based on a combination of l 1 and l 2 penalties capable of handling possible correlation among genes in different groups. While the l 1 penalty maintains sparsity, the l 2 penalty induces smoothness based on the information from the Laplacian matrix, which represents the correlation pattern among genes. We combined logistic regression with the BH-FDR (Benjamini and Hochberg false discovery rate) screening procedure and a newly developed rank-based selection method to come up with an optimal model retaining the important genes. Through simulation studies and real-world application to high-dimensional colon cancer gene expression data, we demonstrated that the proposed rank-based method outperforms such currently popular methods as lasso, adaptive lasso and elastic net when applied both to gene selection and classification.

[1]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[2]  N. Meinshausen,et al.  LASSO-TYPE RECOVERY OF SPARSE REPRESENTATIONS FOR HIGH-DIMENSIONAL DATA , 2008, 0806.0145.

[3]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[4]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[5]  Xiaoyu Jiang,et al.  IPF-LASSO: Integrative L 1-Penalized Regression with Penalty Factors for Prediction Based on Multi-Omics Data , 2017, Comput. Math. Methods Medicine.

[6]  Yoav Benjamini,et al.  Identifying differentially expressed genes using false discovery rate controlling procedures , 2003, Bioinform..

[7]  L. V. van't Veer,et al.  Cross‐validated Cox regression on microarray gene expression data , 2006, Statistics in medicine.

[8]  S. Sathiya Keerthi,et al.  A simple and efficient algorithm for gene selection using sparse logistic regression , 2003, Bioinform..

[9]  Christopher I. Amos,et al.  Gene set selection via LASSO penalized regression (SLPR) , 2017, Nucleic acids research.

[10]  Shuang Wang,et al.  Penalized logistic regression for high-dimensional DNA methylation data with case-control studies , 2012, Bioinform..

[11]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[12]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[13]  R. Tibshirani,et al.  Sparsity and smoothness via the fused lasso , 2005 .

[14]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[15]  Muhammad Hisyam Lee,et al.  Penalized logistic regression with the adaptive LASSO for gene selection in high-dimensional cancer classification , 2015, Expert Syst. Appl..

[16]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[17]  Xiao-Ying Liu,et al.  Feature Selection and Cancer Classification via Sparse Logistic Regression with the Hybrid L1/2 +2 Regularization , 2016, PloS one.

[18]  Hongzhe Li,et al.  VARIABLE SELECTION AND REGRESSION ANALYSIS FOR GRAPH-STRUCTURED COVARIATES WITH AN APPLICATION TO GENOMICS. , 2010, The annals of applied statistics.

[19]  Hongzhe Li,et al.  In Response to Comment on "Network-constrained regularization and variable selection for analysis of genomic data" , 2008, Bioinform..

[20]  Yi Li,et al.  Bayesian automatic relevance determination algorithms for classifying gene expression data. , 2002, Bioinformatics.

[21]  Dennis L. Sun,et al.  Exact post-selection inference, with application to the lasso , 2013, 1311.6238.

[22]  Ehsan Lotfi,et al.  Gene expression microarray classification using PCA-BEL , 2014, Comput. Biol. Medicine.

[23]  Giovanna Cilluffo,et al.  The Induced Smoothed lasso: A practical framework for hypothesis testing in high dimensional regression , 2020, Statistical methods in medical research.

[24]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[25]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[26]  Hokeun Sun,et al.  Network-based regularization for matched case-control analysis of high-dimensional DNA methylation data. , 2013, Statistics in medicine.

[27]  R. Tibshirani,et al.  PATHWISE COORDINATE OPTIMIZATION , 2007, 0708.1485.