A Cluster-Based Boosting Algorithm for Bankruptcy Prediction in a Highly Imbalanced Dataset

Bankruptcy prediction has been a popular and challenging research topic in both computer science and economics due to its importance to financial institutions, fund managers, lenders, governments, as well as economic stakeholders in recent years. In a bankruptcy dataset, the problem of class imbalance, in which the number of bankruptcy companies is smaller than the number of normal companies, leads to a standard classification algorithm that does not work well. Therefore, this study proposes a cluster-based boosting algorithm as well as a robust framework using the CBoost algorithm and Instance Hardness Threshold (RFCI) for effective bankruptcy prediction of a financial dataset. This framework first resamples the imbalance dataset by the undersampling method using Instance Hardness Threshold (IHT), which is used to remove the noise instances having large IHT value in the majority class. Then, this study proposes a Cluster-based Boosting algorithm, namely CBoost, for dealing with the class imbalance. In this algorithm, the majority class will be clustered into a number of clusters. The distance from each sample to its closest centroid will be used to initialize its weight. This algorithm will perform several iterations for finding weak classifiers and combining them to create a strong classifier. The resample set resulting from the previous module, will be used to train CBoost, which will be used to predict bankruptcy for the validation set. The proposed framework is verified by the Korean bankruptcy dataset (KBD), which has a very small balancing ratio in both the training and the testing phases. The experimental results of this research show that the proposed framework achieves 86.8% in AUC (area under the ROC curve) and outperforms several methods for dealing with the imbalanced data problem for bankruptcy prediction such as GMBoost algorithm, the oversampling-based method using SMOTEENN, and the clustering-based undersampling method for bankruptcy prediction in the experimental dataset.

[1]  Mumtaz Ali,et al.  A Novel Clustering Algorithm in a Neutrosophic Recommender System for Medical Diagnosis , 2017, Cognitive Computation.

[2]  Tu-Anh Nguyen-Hoang,et al.  A weighted N-list-based method for mining frequent weighted itemsets , 2018, Expert Syst. Appl..

[3]  Jiawei Luo,et al.  A novel approach for predicting microRNA-disease associations by unbalanced bi-random walk on heterogeneous network , 2017, J. Biomed. Informatics.

[4]  Witold Pedrycz,et al.  Mining constrained inter-sequence patterns: a novel approach to cope with item constraints , 2018, Applied Intelligence.

[5]  Mohammad Shojafar,et al.  Joint QoS and Congestion Control Based on Traffic Prediction in SDN , 2017 .

[6]  Francisco Chiclana,et al.  Dynamic structural neural network , 2018, J. Intell. Fuzzy Syst..

[7]  Fernando Nogueira,et al.  Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning , 2016, J. Mach. Learn. Res..

[8]  Ekrem Duman,et al.  A profit-driven Artificial Neural Network (ANN) with applications to fraud detection and direct marketing , 2016, Neurocomputing.

[9]  Tzu-Chuen Lu,et al.  Interpolation-based hiding scheme using the modulus function and re-encoding strategy , 2018, Signal Process..

[10]  Bay Vo,et al.  A lattice-based approach for mining high utility association rules , 2017, Inf. Sci..

[11]  Nic Herndon,et al.  A Study of Domain Adaptation Classifiers Derived From Logistic Regression for the Task of Splice Site Prediction , 2016, IEEE Transactions on NanoBioscience.

[12]  R. L. Thorndike Who belongs in the family? , 1953 .

[13]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[14]  Jakub M. Tomczak,et al.  Ensemble boosted trees with synthetic features generation in application to bankruptcy prediction , 2016, Expert Syst. Appl..

[15]  Bay Vo,et al.  Personalized Facets for Semantic Search Using Linked Open Data with Social Networks , 2012, 2012 Third International Conference on Innovations in Bio-Inspired Computing and Applications.

[16]  Sung Wook Baik,et al.  Oversampling Techniques for Bankruptcy Prediction: Novel Features from a Transaction Dataset , 2018, Symmetry.

[17]  Stefano Tomasin,et al.  Cluster-head based feedback for simplified time reversal prefiltering in ultra-wideband systems , 2017, Phys. Commun..

[18]  Hyeonjoon Moon,et al.  Utilizing text recognition for the defects extraction in sewers CCTV inspection videos , 2018, Comput. Ind..

[19]  Dae-Ki Kang,et al.  Geometric mean based boosting algorithm with over-sampling to resolve data imbalance problem for bankruptcy prediction , 2015, Expert Syst. Appl..

[20]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[21]  Tony R. Martinez,et al.  An instance level analysis of data complexity , 2014, Machine Learning.

[22]  Tzung-Pei Hong,et al.  Mining frequent itemsets using the N-list and subsume concepts , 2014, Int. J. Mach. Learn. Cybern..

[23]  Chih-Fong Tsai,et al.  Clustering-based undersampling in class-imbalanced data , 2017, Inf. Sci..

[24]  Mumtaz Ali,et al.  δ-equality of intuitionistic fuzzy sets: a new proximity measure and applications in medical diagnosis , 2018, Applied Intelligence.

[25]  Sung Wook Baik,et al.  Efficient algorithms for mining top-rank-k erasable patterns using pruning strategies and the subsume concept , 2018, Eng. Appl. Artif. Intell..

[26]  Unil Yun,et al.  Efficient algorithm for mining high average-utility itemsets in incremental transaction databases , 2017, Applied Intelligence.

[27]  Bay Vo An Efficient Method for Mining Frequent Weighted Closed Itemsets from Weighted Item Transaction Databases , 2017, J. Inf. Sci. Eng..

[28]  Herbert Kimura,et al.  Machine learning models and bankruptcy prediction , 2017, Expert Syst. Appl..

[29]  Akito Monden,et al.  MAHAKIL: Diversity Based Oversampling Approach to Alleviate the Class Imbalance Issue in Software Defect Prediction , 2018, IEEE Trans. Software Eng..

[30]  Karan Singh,et al.  Congestion control in wireless sensor networks by hybrid multi-objective optimization algorithm , 2018, Comput. Networks.

[31]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[32]  Gangman Yi,et al.  Analysis of Clustering Evaluation Considering Features of Item Response Data Using Data Mining Technique for Setting Cut-Off Scores , 2017, Symmetry.

[33]  Arun Kumar Sangaiah,et al.  UAV based wilt detection system via convolutional neural networks , 2020, Sustain. Comput. Informatics Syst..