A class imbalance-aware Relief algorithm for the classification of tumors using microarray gene expression data

DNA microarray data has been widely used in cancer research due to the significant advantage helped to successfully distinguish between tumor classes. However, typical gene expression data usually presents a high-dimensional imbalanced characteristic, which poses severe challenge for traditional machine learning methods to construct a robust classifier performing well on both the minority and majority classes. As one of the most successful feature weighting techniques, Relief is considered to particularly suit to handle high-dimensional problems. Unfortunately, almost all relief-based methods have not taken the class imbalance distribution into account. This study identifies that existing Relief-based algorithms may underestimate the features with the discernibility ability of minority classes, and ignore the distribution characteristic of minority class samples. As a result, an additional bias towards being classified into the majority classes can be introduced. To this end, a new method, named imRelief, is proposed for efficiently handling high-dimensional imbalanced gene expression data. imRelief can correct the bias towards to the majority classes, and consider the scattered distributional characteristic of minority class samples in the process of estimating feature weights. This way, imRelief has the ability to reward the features which perform well at separating the minority classes from other classes. Experiments on four microarray gene expression data sets demonstrate the effectiveness of imRelief in both feature weighting and feature subset selection applications.

[1]  Chris H. Q. Ding,et al.  Minimum Redundancy Feature Selection from Microarray Gene Expression Data , 2005, J. Bioinform. Comput. Biol..

[2]  Xue-wen Chen,et al.  FAST: a roc-based feature selection metric for small samples and imbalanced data classification problems , 2008, KDD.

[3]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, Sixth International Conference on Data Mining (ICDM'06).

[4]  Li Zhang,et al.  Feature weight estimation based on dynamic representation and neighbor sparse reconstruction , 2018, Pattern Recognit..

[5]  Rok Blagus,et al.  Class prediction for high-dimensional class-imbalanced data , 2010, BMC Bioinformatics.

[6]  Pavel Pudil,et al.  Novel Methods for Subset Selection with Respect to Problem Knowledge , 1998, IEEE Intell. Syst..

[7]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[8]  Elena Marchiori,et al.  Class Dependent Feature Weighting and K-Nearest Neighbor Classification , 2013, PRIB.

[9]  Yaping Lin,et al.  A Privacy-Preserving Principal Component Analysis Outsourcing Framework , 2018, 2018 17th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/ 12th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE).

[10]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[11]  Yijun Sun,et al.  Iterative RELIEF for Feature Weighting: Algorithms, Theories, and Applications , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Xin Yao,et al.  MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning , 2014 .

[13]  Randal S. Olson,et al.  Relief-Based Feature Selection: Introduction and Review , 2017, J. Biomed. Informatics.

[14]  Larry A. Rendell,et al.  The Feature Selection Problem: Traditional Methods and a New Algorithm , 1992, AAAI.

[15]  P. Langley Selection of Relevant Features in Machine Learning , 1994 .

[16]  Li Zhang,et al.  Multiple SVM-RFE for multi-class gene selection on DNA Microarray data , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[17]  Xuehua Wang,et al.  A New Over-Sampling Approach: Random-SMOTE for Learning from Imbalanced Data Sets , 2011, KSEM.

[18]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[19]  James J. Chen,et al.  Class-imbalanced classifiers for high-dimensional data , 2013, Briefings Bioinform..

[20]  D. Tiwari,et al.  Handling Class Imbalance Problem Using Feature Selection , 2014 .

[21]  Yaping Lin,et al.  Synthetic minority oversampling technique for multiclass imbalance problems , 2017, Pattern Recognit..

[22]  Danh V. Nguyen,et al.  Tumor classification by partial least squares using microarray gene expression data , 2002, Bioinform..