Regularized receiver operating characteristic-based logistic regression for grouped variable selection with composite criterion

It is well known that statistical classifiers trained from imbalanced data lead to low true positive rates and select inconsistent significant variables. In this article, an improved method is proposed to enhance the classification accuracy for the minority class by differentiating misclassification cost for each group. The overall error rate is replaced by an alternative composite criterion. Furthermore, we propose an approach to estimate the tuning parameter, the composite criterion, and the cut-point simultaneously. Simulations show that the proposed method achieves a high true positive rate on prediction and a good performance on variable selection for both continuous and categorical predictors, even with highly imbalanced data. An illustrative example of the analysis of the suboptimal health state data in traditional Chinese medicine is discussed to show the reasonable application of the proposed method.

[1]  Jian Huang,et al.  Penalized feature selection and classification in bioinformatics , 2008, Briefings Bioinform..

[2]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[3]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[4]  Guo-Zheng Li,et al.  An asymmetric classifier based on partial least squares , 2010, Pattern Recognit..

[5]  P. Bühlmann,et al.  The group lasso for logistic regression , 2008 .

[6]  Yang Li,et al.  Diagnosis Analysis of 4 TCM Patterns in Suboptimal Health Status: A Structural Equation Modelling Approach , 2012, Evidence-based complementary and alternative medicine : eCAM.

[7]  Wei Wang,et al.  Development and Evaluation of a Questionnaire for Measuring Suboptimal Health Status in Urban Chinese , 2009, Journal of epidemiology.

[8]  W. Youden,et al.  Index for rating diagnostic tests , 1950, Cancer.

[9]  R. Tibshirani,et al.  Regression shrinkage and selection via the lasso: a retrospective , 2011 .

[10]  Hua Zhou,et al.  Penalized Regression for Genome-Wide Association Screening of Sequence Data , 2011, Pacific Symposium on Biocomputing.

[11]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[12]  Stan Matwin,et al.  Learning When Negative Examples Abound , 1997, ECML.

[13]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[14]  Edward Y. Chang,et al.  Class-Boundary Alignment for Imbalanced Dataset Learning , 2003 .

[15]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[16]  Yufeng Liu,et al.  Adaptive Weighted Learning for Unbalanced Multicategory Classification , 2009, Biometrics.

[17]  Li Yan Methodologies for Variables Selection in Data Mining and Applications in Health Food Market Research , 2013 .

[18]  Peter Bühlmann Regression shrinkage and selection via the Lasso: a retrospective (Robert Tibshirani): Comments on the presentation , 2011 .

[19]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[20]  James J. Chen,et al.  Class-imbalanced classifiers for high-dimensional data , 2013, Briefings Bioinform..

[21]  Cun-Hui Zhang,et al.  A group bridge approach for variable selection , 2009, Biometrika.