Oversampling Techniques for Bankruptcy Prediction: Novel Features from a Transaction Dataset

In recent years, weakened by the fall of economic growth, many enterprises fell into the crisis caused by financial difficulties. Bankruptcy prediction, a machine learning model, is a great utility for financial institutions, fund managers, lenders, governments, and economic stakeholders. Due to the number of bankrupt companies compared to that of non-bankrupt companies, bankruptcy prediction faces the problem of imbalanced data. This study first presents the bankruptcy prediction framework. Then, five oversampling techniques are used to deal with imbalance problems on the experimental dataset which were collected from Korean companies in two years from 2016 to 2017. Experimental results show that using oversampling techniques to balance the dataset in the training stage can enhance the performance of the bankruptcy prediction. The best overall Area Under the Curve (AUC) of this framework can reach 84.2%. Next, the study extracts more features by combining the financial dataset with transaction dataset to increase the performance for bankruptcy prediction and achieves 84.4% AUC.

[1]  Zhenyu He,et al.  A multi-view model for visual tracking via correlation filters , 2016, Knowl. Based Syst..

[2]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[3]  Yuri Zelenkov,et al.  Two-step classification method based on genetic algorithm for bankruptcy forecasting , 2017, Expert Syst. Appl..

[4]  Fernando Nogueira,et al.  Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning , 2016, J. Mach. Learn. Res..

[5]  Sashank Dara,et al.  Online Defect Prediction for Imbalanced Data , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[6]  Xiao Liu,et al.  Adapted ensemble classification algorithm based on multiple classifier system and feature selection for classifying multi-class imbalanced data , 2016, Knowl. Based Syst..

[7]  Xiong Xiong,et al.  The effect of genetic algorithm learning with a classifier system in limit order markets , 2017, Eng. Appl. Artif. Intell..

[8]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[9]  Bay Vo,et al.  Mining top-k co-occurrence items with sequential pattern , 2017, Expert Syst. Appl..

[10]  Jing He,et al.  A Classifier Hub for Imbalanced Financial Data , 2016, ADC.

[11]  Dae-Ki Kang,et al.  Geometric mean based boosting algorithm with over-sampling to resolve data imbalance problem for bankruptcy prediction , 2015, Expert Syst. Appl..

[12]  Tzung-Pei Hong,et al.  Efficient Algorithms for Mining Erasable Closed Patterns From Product Datasets , 2017, IEEE Access.

[13]  Ekrem Duman,et al.  A profit-driven Artificial Neural Network (ANN) with applications to fraud detection and direct marketing , 2016, Neurocomputing.

[14]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[15]  Francesco Sergio Pisani,et al.  An Incremental Ensemble Evolved by using Genetic Programming to Efficiently Detect Drifts in Cyber Security Datasets , 2016, GECCO.

[16]  Wojtek Michalowski,et al.  Application of Preprocessing Methods to Imbalanced Clinical Data: An Experimental Study , 2016, ITIB.

[17]  Jun Li,et al.  Grey wolf optimization evolving kernel extreme learning machine: Application to bankruptcy prediction , 2017, Eng. Appl. Artif. Intell..

[18]  Bay Vo,et al.  A novel approach for mining maximal frequent patterns , 2017, Expert Syst. Appl..

[19]  Sung Wook Baik,et al.  Efficient algorithms for mining top-rank-k erasable patterns using pruning strategies and the subsume concept , 2018, Eng. Appl. Artif. Intell..

[20]  Kyung-shik Shin,et al.  Optimization of cluster-based evolutionary undersampling for the artificial neural networks in corporate bankruptcy prediction , 2016, Expert Syst. Appl..

[21]  Le Hoang Son,et al.  Novel fuzzy clustering scheme for 3D wireless sensor networks , 2017, Appl. Soft Comput..

[22]  Jie Zhang,et al.  A Novel Online Sequential Extreme Learning Machine for Gas Utilization Ratio Prediction in Blast Furnaces , 2017, Sensors.

[23]  Mumtaz Ali,et al.  A Novel Clustering Algorithm in a Neutrosophic Recommender System for Medical Diagnosis , 2017, Cognitive Computation.

[24]  Qionghai Dai,et al.  ACID: Association Correction for Imbalanced Data in GWAS , 2018, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[25]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[26]  Le Hoang Son,et al.  Linguistic Vector Similarity Measures and Applications to Linguistic Information Classification , 2017, Int. J. Intell. Syst..

[27]  Sungzoon Cho,et al.  EUS SVMs: Ensemble of Under-Sampled SVMs for Data Imbalance Problems , 2006, ICONIP.

[28]  Bay Vo,et al.  The lattice‐based approaches for mining association rules: a review , 2016, WIREs Data Mining Knowl. Discov..

[29]  Herbert Kimura,et al.  Machine learning models and bankruptcy prediction , 2017, Expert Syst. Appl..

[30]  Witold Pedrycz,et al.  Mining erasable itemsets with subset and superset itemset constraints , 2017, Expert Syst. Appl..

[31]  Huiling Chen,et al.  An Effective Computational Model for Bankruptcy Prediction Using Kernel Extreme Learning Machine Approach , 2017 .

[32]  Jakub M. Tomczak,et al.  Ensemble boosted trees with synthetic features generation in application to bankruptcy prediction , 2016, Expert Syst. Appl..

[33]  Le Hoang Son,et al.  Some novel hybrid forecast methods based on picture fuzzy clustering for weather nowcasting from satellite image sequences , 2016, Applied Intelligence.

[34]  Jorma Laurikkala,et al.  Improving Identification of Difficult Small Classes by Balancing Class Distribution , 2001, AIME.

[35]  Rahul Bhattacharyya,et al.  Air filter particulate loading detection using smartphone audio and optimized ensemble classification , 2017, Eng. Appl. Artif. Intell..

[36]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[37]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[38]  W. Pietruszkiewicz,et al.  Dynamical systems and nonlinear Kalman filtering applied in classification , 2008, 2008 7th IEEE International Conference on Cybernetic Intelligent Systems.