Information gain directed genetic algorithm wrapper feature selection for credit rating

Abstract Financial credit scoring is one of the most crucial processes in the finance industry sector to be able to assess the credit-worthiness of individuals and enterprises. Various statistics-based machine learning techniques have been employed for this task. “Curse of Dimensionality” is still a significant challenge in machine learning techniques. Some research has been carried out on Feature Selection (FS) using genetic algorithm as wrapper to improve the performance of credit scoring models. However, the challenge lies in finding an overall best method in credit scoring problems and improving the time-consuming process of feature selection. In this study, the credit scoring problem is investigated through feature selection to improve classification performance. This work proposes a novel approach to feature selection in credit scoring applications, called as Information Gain Directed Feature Selection algorithm (IGDFS), which performs the ranking of features based on information gain, propagates the top m features through the GA wrapper (GAW) algorithm using three classical machine learning algorithms of KNN, Naive Bayes and Support Vector Machine (SVM) for credit scoring. The first stage of information gain guided feature selection can help reduce the computing complexity of GA wrapper, and the information gain of features selected with the IGDFS can indicate their importance to decision making. Regarding the classification accuracy, SVM accuracy is always better than KNN and NB for Baseline techniques, GAW and IGDFS. Also, we can conclude that the IGDFS achieved better performance than generic GAW, and GAW obtained better performance than the corresponding single classifiers (baseline) for almost all cases, except for the German Credit dataset, IGDFS + KNN has worse performance than generic GAW and the single classifier KNN. Removing features with low information gain could produce conflict with the original data structure for KNN, and thus affect the performance of IGDFS + KNN. Regarding the ROC performance, for the German Credit Dataset, the three classic machine learning algorithms, SVM, KNN and Naive Bayes in the wrapper of IGDFS GA obtained almost the same performance. For the Australian credit dataset and the Taiwan Credit dataset, the IGDFS + Naive Bayes achieved the largest area under ROC curves.

[1]  Nagamma Patil,et al.  Genetic algorithm based wrapper feature selection on hybrid prediction model for analysis of high dimensional data , 2014, 2014 9th International Conference on Industrial and Information Systems (ICIIS).

[2]  Mohamed Limam,et al.  A THREE-STAGE FEATURE SELECTION USING QUADRATIC PROGRAMMING FOR CREDIT SCORING , 2013, Appl. Artif. Intell..

[3]  Amir-Massoud Bidgoli,et al.  A Hybrid Feature Selection Method to Improve Performance of a Group of Classification Algorithms , 2013, ArXiv.

[4]  Marc Martijn Lankhorst Genetic algorithms in data analysis , 1996 .

[5]  Li Zhuo,et al.  A genetic algorithm based wrapper feature selection method for classification of hyperspectral images using support vector machine , 2008, Geoinformatics.

[6]  Wilfried N. Gansterer,et al.  On the Relationship Between Feature Selection and Classification Accuracy , 2008, FSDM.

[7]  Bart Baesens,et al.  Filter‐ versus wrapper‐based feature selection for credit scoring , 2005, Int. J. Intell. Syst..

[8]  Mu-Chen Chen,et al.  Credit scoring with a data mining approach based on support vector machines , 2007, Expert Syst. Appl..

[9]  Nitesh V. Chawla,et al.  Information Gain, Correlation and Support Vector Machines , 2006, Feature Extraction.

[10]  Deron Liang,et al.  The effect of feature selection on financial distress prediction , 2015, Knowl. Based Syst..

[11]  E. Talbi,et al.  A Genetic Algorithm for Feature Selection in Data-Mining for Genetics , 2001 .

[12]  Hedieh Sajedi,et al.  A hybrid data mining model of feature selection algorithms and ensemble learning classifiers for credit scoring , 2015 .

[13]  Nguyen Duc Nhan,et al.  A Novel Credit Scoring Prediction Model based on Feature Selection Approach and Parallel Random Forest , 2016 .

[14]  Hongmei He,et al.  An academic review: applications of data mining techniques in finance industry , 2017 .

[15]  Ashutosh Tiwari,et al.  Incremental information gain analysis of input attribute impact on RBF-kernel SVM spam detection , 2016, 2016 IEEE Congress on Evolutionary Computation (CEC).

[16]  Andrew Stranieri,et al.  A genetic algorithm-neural network wrapper approach for bundle branch block detection , 2016, 2016 Computing in Cardiology Conference (CinC).

[17]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[18]  Julio Ortega Lopera,et al.  Parallel alternatives for evolutionary multi-objective optimization in unsupervised feature selection , 2015, Expert Syst. Appl..

[19]  Feng-Chia Li,et al.  Combination of feature selection approaches with SVM in credit scoring , 2010, Expert Syst. Appl..

[20]  Bhekisipho Twala,et al.  Multiple classifier application to credit risk assessment , 2010, Expert Syst. Appl..

[21]  Richard Weber,et al.  A wrapper method for feature selection using Support Vector Machines , 2009, Inf. Sci..

[22]  Xiaoming Xu,et al.  A hybrid genetic algorithm for feature selection wrapper based on mutual information , 2007, Pattern Recognit. Lett..

[23]  Arif Gülten,et al.  Genetic algorithm wrapped Bayesian network feature selection applied to differential diagnosis of erythemato-squamous diseases , 2013, Digit. Signal Process..

[24]  Kyoung-jae Kim,et al.  A corporate credit rating model using multi-class support vector machines with an ordinal pairwise partitioning approach , 2012, Comput. Oper. Res..

[25]  Ha-Nam Nguyen,et al.  FRFE: Fast Recursive Feature Elimination for Credit Scoring , 2016, ICTCC.

[26]  Stjepan Oreski,et al.  Genetic algorithm-based heuristic for feature selection in credit risk assessment , 2014, Expert Syst. Appl..

[27]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[28]  El-Sayed M. El-Alfy,et al.  Towards scalable rough set based attribute subset selection for intrusion detection using parallel genetic algorithm in MapReduce , 2016, Simul. Model. Pract. Theory.

[29]  Mahmood Alborzi,et al.  The Use of Genetic Algorithm, Clustering and Feature Selection Techniques in Construction of Decision Tree Models for Credit Scoring , 2013 .

[30]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[31]  Manoj Kumar Tiwari,et al.  Computational time reduction for credit scoring: An integrated approach based on support vector machine and stratified sampling method , 2012, Expert Syst. Appl..

[32]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[33]  Hongmei He,et al.  Prediction of Earnings Per Share for industry , 2015, 2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K).

[34]  Lun-Ping Hung,et al.  A data driven ensemble classifier for credit scoring analysis , 2010, Expert Syst. Appl..

[35]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[36]  Hao Chen,et al.  A Heuristic Feature Selection Approach for Text Categorization by Using Chaos Optimization and Genetic Algorithm , 2013 .

[37]  Hiroshi Motoda,et al.  Feature Extraction, Construction and Selection: A Data Mining Perspective , 1998 .

[38]  Sebastián Maldonado,et al.  Cost-based feature selection for Support Vector Machines: An application in credit scoring , 2017, Eur. J. Oper. Res..

[39]  Terry Harris,et al.  Credit scoring using the clustered support vector machine , 2015, Expert Syst. Appl..

[40]  Y. Liu,et al.  Data mining feature selection for credit scoring models , 2005, J. Oper. Res. Soc..

[41]  Feng-Chia Li,et al.  The Hybrid Credit Scoring Strategies Based on KNN Classifier , 2009, 2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery.

[42]  Sayan Mukherjee,et al.  Feature Selection for SVMs , 2000, NIPS.

[43]  Kamran Shahanaghi,et al.  Combination of feature selection and optimized fuzzy apriori rules: the case of credit scoring , 2015, Int. Arab J. Inf. Technol..

[44]  Constantin F. Aliferis,et al.  A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis , 2004, Bioinform..

[45]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[46]  Naif Alajlan,et al.  Swarm Optimization of Structuring Elements for VHR Image Classification , 2013, IEEE Geoscience and Remote Sensing Letters.

[47]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[48]  Ali Zeinal Hamadani,et al.  AN INTEGRATED GENETIC -BASED MODEL OF NAIVE BAYES NETWORKS FOR CREDIT SCORING , 2013 .

[49]  Houkuan Huang,et al.  Feature selection for text classification with Naïve Bayes , 2009, Expert Syst. Appl..

[50]  Riccardo Poli,et al.  Particle swarm optimization , 1995, Swarm Intelligence.

[51]  V. Bajic,et al.  DWFS: A Wrapper Feature Selection Tool Based on a Parallel Genetic Algorithm , 2015, PloS one.

[52]  Jian Ma,et al.  Rough set and scatter search metaheuristic based feature selection for credit scoring , 2012, Expert Syst. Appl..

[53]  Zhi Chen,et al.  A Parallel Genetic Algorithm Based Feature Selection and Parameter Optimization for Support Vector Machine , 2016, Sci. Program..

[54]  Ron Kohavi,et al.  The Wrapper Approach , 1998 .

[55]  Shih-Wei Lin,et al.  Particle swarm optimization for parameter determination and feature selection of support vector machines , 2008, Expert Syst. Appl..

[56]  Ibrahim Kucukkoc,et al.  Using response surface design to determine the optimal parameters of genetic algorithm and a case study , 2013 .

[57]  Kin Keung Lai,et al.  Credit risk evaluation using a weighted least squares SVM classifier with design of experiment for parameter selection , 2011, Expert Syst. Appl..

[58]  Verónica Bolón-Canedo,et al.  Recent advances and emerging challenges of feature selection in the context of big data , 2015, Knowl. Based Syst..

[59]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[60]  Lalit M. Patnaik,et al.  Genetic algorithms: a survey , 1994, Computer.

[61]  Francisco Herrera,et al.  Evolutionary wrapper approaches for training set selection as preprocessing mechanism for support vector machines: Experimental evaluation and support vector analysis , 2016, Appl. Soft Comput..

[62]  Krzysztof Michalak,et al.  Feature selection in corporate credit rating prediction , 2013, Knowl. Based Syst..

[63]  Li Li,et al.  A robust hybrid between genetic algorithm and support vector machine for extracting an optimal feature gene subset. , 2005, Genomics.

[64]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[65]  Melanie Mitchell,et al.  An introduction to genetic algorithms , 1996 .

[66]  Christophe Mues,et al.  An experimental comparison of classification algorithms for imbalanced credit scoring data sets , 2012, Expert Syst. Appl..