Feature selection for high dimensional imbalanced class data using harmony search

Misclassification costs of minority class data in real-world applications can be very high. This is a challenging problem especially when the data is also high in dimensionality because of the increase in overfitting and lower model interpretability. Feature selection is recently a popular way to address this problem by identifying features that best predict a minority class. This paper introduces a novel feature selection method call SYMON which uses symmetrical uncertainty and harmony search. Unlike existing methods, SYMON uses symmetrical uncertainty to weigh features with respect to their dependency to class labels. This helps to identify powerful features in retrieving the least frequent class labels. SYMON also uses harmony search to formulate the feature selection phase as an optimisation problem to select the best possible combination of features. The proposed algorithm is able to deal with situations where a set of features have the same weight, by incorporating two vector tuning operations embedded in the harmony search process. In this paper, SYMON is compared against various benchmark feature selection algorithms that were developed to address the same issue. Our empirical evaluation on different micro-array data sets using G-Mean and AUC measures confirm that SYMON is a comparable or a better solution to current benchmarks.

[1]  Jesús S. Aguilar-Ruiz,et al.  Fast feature selection aimed at high-dimensional data via hybrid-sequential-ranked searches , 2012, Expert Syst. Appl..

[2]  Mahamed G. H. Omran,et al.  Global-best harmony search , 2008, Appl. Math. Comput..

[3]  Zong Woo Geem,et al.  Novel derivative of harmony search algorithm for discrete design variables , 2008, Appl. Math. Comput..

[4]  Yin-Fu Huang,et al.  Music genre classification based on local feature selection using a self-adaptive harmony search algorithm , 2014, Data Knowl. Eng..

[5]  Dalila Boughaci,et al.  Hybrid Harmony Search Combined with Stochastic Local Search for Feature Selection , 2015, Neural Processing Letters.

[6]  Taghi M. Khoshgoftaar,et al.  Knowledge discovery from imbalanced and noisy data , 2009, Data Knowl. Eng..

[7]  Uffe Kock Wiil,et al.  Weighted bee colony algorithm for discrete optimization problems with application to feature selection , 2015, Eng. Appl. Artif. Intell..

[8]  Youwei Wang,et al.  Novel feature selection method based on harmony search for email classification , 2015, Knowl. Based Syst..

[9]  Ali Husseinzadeh Kashan,et al.  DisABC: A new artificial bee colony algorithm for binary optimization , 2012, Appl. Soft Comput..

[10]  Duílio A. N. S. Silva,et al.  An instance selection method for large datasets based on Markov Geometric Diffusion , 2016, Data Knowl. Eng..

[11]  Vadlamani Ravi,et al.  Predicting credit card customer churn in banks using data mining , 2008, Int. J. Data Anal. Tech. Strateg..

[12]  Qiang Shen,et al.  Feature Selection With Harmony Search , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[13]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[14]  Xue-wen Chen,et al.  FAST: a roc-based feature selection metric for small samples and imbalanced data classification problems , 2008, KDD.

[15]  Mohammad Reza Meybodi,et al.  Hybridization of K-Means and Harmony Search Methods for Web Page Clustering , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[16]  Mehrnoush Shamsfard,et al.  An effective Web page recommender using binary data clustering , 2015, Information Retrieval Journal.

[17]  Irena Koprinska,et al.  Correlation and instance based feature selection for electricity load forecasting , 2015, Knowl. Based Syst..

[18]  Taghi M. Khoshgoftaar,et al.  Feature Selection with High-Dimensional Imbalanced Data , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[19]  Albert Y. Zomaya,et al.  Ensemble-Based Wrapper Methods for Feature Selection and Class Imbalance Learning , 2013, PAKDD.

[20]  Mohammad Reza Meybodi,et al.  Efficient stochastic algorithms for document clustering , 2013, Inf. Sci..

[21]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[22]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[23]  Ali Hamzeh,et al.  DBFS: An effective Density Based Feature Selection scheme for small sample size and high dimensional imbalanced data sets , 2012, Data Knowl. Eng..

[24]  N. Ramaraj,et al.  A novel hybrid feature selection via Symmetrical Uncertainty ranking based local memetic search algorithm , 2010, Knowl. Based Syst..

[25]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[26]  Richard Weber,et al.  Feature selection for high-dimensional class-imbalanced data sets using Support Vector Machines , 2014, Inf. Sci..

[27]  Byung Ro Moon,et al.  Hybrid Genetic Algorithms for Feature Selection , 2004, IEEE Trans. Pattern Anal. Mach. Intell..

[28]  Ling Zheng,et al.  Self-adjusting harmony search-based feature selection , 2014, Soft Computing.

[29]  Loïc Cerf,et al.  Parameter-free classification in multi-class imbalanced data sets , 2013, Data Knowl. Eng..

[30]  Le Hoang Son,et al.  Intuitionistic fuzzy recommender systems: An effective tool for medical diagnosis , 2015, Knowl. Based Syst..

[31]  James Parker,et al.  on Knowledge and Data Engineering, , 1990 .

[32]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[33]  Alireza Mohammad Shahri,et al.  A novel efficient algorithm for mobile robot localization , 2013, Robotics Auton. Syst..

[34]  Sushanta Karmakar,et al.  Intrusion detection in Mobile Ad-hoc Networks: Bayesian game formulation , 2016 .

[35]  M. Punithavalli,et al.  An E-SMOTE technique for feature selection in High-Dimensional Imbalanced Dataset , 2011, 2011 3rd International Conference on Electronics Computer Technology.

[36]  M. Fesanghary,et al.  An improved harmony search algorithm for solving optimization problems , 2007, Appl. Math. Comput..

[37]  Mohammad Reza Meybodi,et al.  Enriched ant colony optimization and its application in feature selection , 2014, Neurocomputing.

[38]  Rana Forsati,et al.  Heuristic Approach to Solve Feature Selection Problem , 2011, DICTAP.

[39]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[40]  K. Topouzelis,et al.  Detection and discrimination between oil spills and look-alike phenomena through neural networks , 2007 .