Weighted ReliefF with threshold constraints of feature selection for imbalanced data classification

Feature selection is a useful method for fulfilling the data classification since the inherent heterogeneity of data and the redundancy of features are often encountered in the current data exploding era. Some commonly used feature selection algorithms, which include but are not limited to Pearson, maximal information coefficient, and ReliefF, are well‐posed under the assumption that instances are distributed homogenously in datasets. However, such an assumption might be not true in the practice. As such, in the presence of data imbalance, these traditional feature selection algorithms might be invalid due to their prejudices to the minority class, which includes few samples. The purpose of the addressed problem in this article is to develop an effective feature selection algorithm for imbalanced judicial datasets, which is capable of extracting essential features while deleting negligible ones according to the practical feature requirements. To achieve this goal, the number and the distribution of samples in each class are fully taken into consideration for the correlation analysis. Compared with the traditional feature selection algorithms, the proposed improved ReliefF algorithm is equipped with: (i) different weights of features according to the characteristics of heterogeneous samples in different classes; (ii) justice for imbalanced datasets; and (iii) threshold constraints resulting from the practical feature requirements. Finally, experiments on a judicial dataset and six public datasets well illustrate the effectiveness and the superiority of the proposed feature selection algorithm in improving the classification accuracy for imbalanced datasets.

[1]  Ali Hamzeh,et al.  DBFS: An effective Density Based Feature Selection scheme for small sample size and high dimensional imbalanced data sets , 2012, Data Knowl. Eng..

[2]  Michel Verleysen,et al.  Multi-Objective Semi-Supervised Feature Selection and Model Selection Based on Pearson's Correlation Coefficient , 2010, CIARP.

[3]  Wei Teng,et al.  Application of kernel principal component and Pearson correlation coefficient in prediction of mine pressure failure , 2017, 2017 Chinese Automation Congress (CAC).

[4]  Zhijun Ding,et al.  A hybrid interpretable credit card users default prediction model based on RIPPER , 2018, Concurr. Comput. Pract. Exp..

[5]  Sang Won Yoon,et al.  A support vector machine-based ensemble algorithm for breast cancer diagnosis , 2017, Eur. J. Oper. Res..

[6]  Gerardo Beni,et al.  A Validity Measure for Fuzzy Clustering , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Richard Weber,et al.  Feature selection for high-dimensional class-imbalanced data sets using Support Vector Machines , 2014, Inf. Sci..

[8]  Xuelong Li,et al.  Feature selection with multi-view data: A survey , 2019, Inf. Fusion.

[9]  Guisong Yang,et al.  Improved Symmetric and Nonnegative Matrix Factorization Models for Undirected, Sparse and Large-Scaled Networks: A Triple Factorization-Based Approach , 2020, IEEE Transactions on Industrial Informatics.

[10]  Johanna S. Hardin,et al.  A robust measure of correlation between two genes on a microarray , 2007, BMC Bioinformatics.

[11]  P. Langley Selection of Relevant Features in Machine Learning , 1994 .

[12]  Lu Liu,et al.  Feature Selection Method Based on Weighted Mutual Information for Imbalanced Data , 2018, Int. J. Softw. Eng. Knowl. Eng..

[13]  Md. Aminul Islam,et al.  Evaluating Document Analysis with kNN Based Approaches in Judicial Offices of Bangladesh , 2018, 2018 Second International Conference on Computing Methodologies and Communication (ICCMC).

[14]  Yongbin Wang,et al.  A feature selection algorithm of music genre classification based on ReliefF and SFS , 2015, 2015 IEEE/ACIS 14th International Conference on Computer and Information Science (ICIS).

[15]  Mohammed Azmi Al-Betar,et al.  A novel gene selection method using modified MRMR and hybrid bat-inspired algorithm with β-hill climbing , 2018, Applied Intelligence.

[16]  Gang Wang,et al.  A hybrid feature selection algorithm for microarray data , 2018, The Journal of Supercomputing.

[17]  Marco Beccuti,et al.  Peculiar Genes Selection: A new features selection method to improve classification performances in imbalanced data sets , 2017, PloS one.

[18]  Lawrence Mitchell,et al.  Parallel classification and feature selection in microarray data using SPRINT , 2014, Concurr. Comput. Pract. Exp..

[19]  Huafu Chen,et al.  Mapping the small-world properties of brain networks in deception with functional near-infrared spectroscopy , 2016, Scientific Reports.

[20]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.

[21]  George Strawn Claude Shannon: Mastermind of Information Theory , 2014, IT Professional.

[22]  Michael Mitzenmacher,et al.  Detecting Novel Associations in Large Data Sets , 2011, Science.

[23]  Saeid Homayouni,et al.  An Improved FCM Algorithm Based on the SVDD for Unsupervised Hyperspectral Data Classification , 2013, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[24]  Bunthit Watanapa,et al.  A Two-Stage Classifier That Identifies Charge and Punishment under Criminal Law of Civil Law System , 2014, IEICE Trans. Inf. Syst..

[25]  Wenjie Zhao,et al.  Variable selection based on maximum information coefficient for data modeling , 2017, 2017 IEEE 2nd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC).

[26]  Fan Yang,et al.  Exploring the stability of feature selection for imbalanced intrusion detection data , 2011, 2011 9th IEEE International Conference on Control and Automation (ICCA).

[27]  Dong-Sheng Cao,et al.  Automatic feature subset selection for decision tree-based ensemble methods in the prediction of bioactivity , 2010 .

[28]  Vittorio Fortino,et al.  A Robust and Accurate Method for Feature Selection and Prioritization from Multi-Class OMICs Data , 2014, PloS one.

[29]  Qian Xu,et al.  Retracted: A hybrid feature selection algorithm for microarray data , 2019, Concurr. Comput. Pract. Exp..

[30]  Tusongjiang Kari,et al.  Power transformer fault diagnosis using FCM and improved PCA , 2009 .

[31]  Jianhua Wang,et al.  Resilient RMPC for Cyber-Physical Systems With Polytopic Uncertainties and State Saturation Under TOD Scheduling: An ADT Approach , 2020, IEEE Transactions on Industrial Informatics.

[32]  Ming Yang,et al.  Feature Selection Embedded Subspace Clustering , 2016, IEEE Signal Processing Letters.

[33]  Tian-Yu Liu,et al.  EasyEnsemble and Feature Selection for Imbalance Data Sets , 2009, 2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing.

[34]  Hua Yu,et al.  A direct LDA algorithm for high-dimensional data - with application to face recognition , 2001, Pattern Recognit..

[35]  Cheng Gao,et al.  A PCA and Mahalanobis distance‐based detection method for logical hardware Trojan , 2019, Concurr. Comput. Pract. Exp..

[36]  Zoe L. Jiang,et al.  Feature selection for high dimensional imbalanced class data based on F-measure optimization , 2017, 2017 International Conference on Security, Pattern Analysis, and Cybernetics (SPAC).

[37]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[38]  Bo Tang,et al.  EEF: Exponentially Embedded Families With Class-Specific Features for Classification , 2016, IEEE Signal Processing Letters.

[39]  Yen-Wei Chen,et al.  Feature Selection Using Recursive Feature Elimination for Handwritten Digit Recognition , 2009, 2009 Fifth International Conference on Intelligent Information Hiding and Multimedia Signal Processing.

[40]  Bernd Barak,et al.  Development of smart feature selection for advanced virtual metrology , 2014, 25th Annual SEMI Advanced Semiconductor Manufacturing Conference (ASMC 2014).

[41]  T. Speed A Correlation for the 21st Century , 2011, Science.

[42]  Chen Gang Shannon Information Model in E-commerce Information Analysis , 2009, 2009 International Joint Conference on Artificial Intelligence.

[43]  Jingjing Wang,et al.  Improved automatic filtering algorithm for imbalanced classification based on SVM-RFE , 2013, 2013 IEEE International Conference on Bioinformatics and Biomedicine.