Simple instance selection for bankruptcy prediction

Instance selection or outlier detection is an important task during data mining, which focuses on filtering out bad data from a given dataset. However, there is no rigid mathematical definition of what constitutes an outlier and an outlier is not a binary property. Therefore, different volumes of outliers may be detected depending on the setting of the threshold for what constitutes an outlier, e.g., the distance in distance-based outlier detection. In this study, we examine bankruptcy prediction performance achieved after removal of different outlier volumes from four widely used datasets, namely the Australian, German, Japanese, and UC Competition datasets. Specifically, a simple distance-based clustering outlier detection method is used. In addition, four popular classification techniques are compared, artificial neural networks, decision trees, logistic regression, and support vector machines. Experiments are conducted to examine (1) the prediction performance of the bankruptcy prediction models with and without instance selection, (2) the stability of bankruptcy prediction models after the removal of outliers from the testing set, and (3) the characteristics of these four different datasets. The results show that with the German dataset it is much more difficult for the prediction models to provide high rates of accuracy after outlier removal, while it is easier with the UC Competition dataset. Removing 50% of the outliers can lead to optimal performance of these four models. In addition, using the removed outliers to test the prediction accuracy of these models, we find that it is support vector machines (SVM) that provide the highest rate of prediction accuracy and perform with much more stability and good noise tolerance than the other three prediction models. Furthermore, the prediction accuracy of the SVM model followed by instance selection is similar to the one without instance selection (i.e., the SVM baseline). In other words, the difference in performance between the SVM and the SVM baseline is the least of the three models in comparison with their corresponding baselines.

[1]  Patrick Haffner,et al.  Support vector machines for histogram-based image classification , 1999, IEEE Trans. Neural Networks.

[2]  Chih-Fong Tsai,et al.  Using neural network ensembles for bankruptcy prediction and credit scoring , 2008, Expert Syst. Appl..

[3]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[4]  Rajeev Rastogi,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD 2000.

[5]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[6]  Wang Jeen-Shing,et al.  A Cluster Validity Measure With Outlier Detection for Support Vector Clustering , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[7]  Victor R. L. Shen,et al.  Verification of Knowledge-Based Systems Using Predicate/Transition Nets , 2008, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[8]  Junyi Shen,et al.  Detecting outlier samples in multivariate time series dataset , 2008, Knowl. Based Syst..

[9]  Qingsheng Zhu,et al.  Finding key attribute subset in dataset for outlier detection , 2011, Knowl. Based Syst..

[10]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[11]  Chih-Fong Tsai Financial decision support using neural networks and support vector machines , 2008, Expert Syst. J. Knowl. Eng..

[12]  Ying Liu,et al.  Cluster-based outlier detection , 2009, Ann. Oper. Res..

[13]  Gang Wang,et al.  A novel bankruptcy prediction model based on an adaptive fuzzy k-nearest neighbor method , 2011, Knowl. Based Syst..

[14]  J. Hilbe Logistic Regression Models , 2009 .

[15]  Alper Ekrem Murat,et al.  A discrete particle swarm optimization method for feature selection in binary classification problems , 2010, Eur. J. Oper. Res..

[16]  Huan Liu,et al.  Instance Selection and Construction for Data Mining , 2001 .

[17]  Vic Barnett,et al.  Outliers in Statistical Data , 1980 .

[18]  Joaquín A. Pacheco,et al.  A variable selection method based on Tabu search for logistic regression models , 2009, Eur. J. Oper. Res..

[19]  Douglas M. Hawkins Identification of Outliers , 1980, Monographs on Applied Probability and Statistics.

[20]  Clara Pizzuti,et al.  Fast Outlier Detection in High Dimensional Spaces , 2002, PKDD.

[21]  Shashi Shekhar,et al.  Detecting graph-based spatial outliers: algorithms and applications (a summary of results) , 2001, KDD '01.

[22]  Shian-Shyong Tseng,et al.  Two-phase clustering process for outliers detection , 2001, Pattern Recognit. Lett..

[23]  Srinivasan Parthasarathy,et al.  Fast mining of distance-based outliers in high-dimensional datasets , 2008, Data Mining and Knowledge Discovery.

[24]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD 2000.

[25]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.

[26]  Jose Miguel Puerta,et al.  Fast wrapper feature subset selection in high-dimensional datasets by means of filter re-ranking , 2012, Knowl. Based Syst..

[27]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[28]  Vadlamani Ravi,et al.  Bankruptcy prediction in banks and firms via statistical and intelligent techniques - A review , 2007, Eur. J. Oper. Res..

[29]  Tony R. Martinez,et al.  Reduction Techniques for Instance-Based Learning Algorithms , 2000, Machine Learning.

[30]  Stephen D. Bay,et al.  Mining distance-based outliers in near linear time with randomization and a simple pruning rule , 2003, KDD '03.

[31]  Raymond T. Ng,et al.  Finding Intensional Knowledge of Distance-Based Outliers , 1999, VLDB.

[32]  Chih-Fong Tsai,et al.  Feature selection in bankruptcy prediction , 2009, Knowl. Based Syst..

[33]  Young-Chan Lee,et al.  The random subspace binary logit (RSBL) model for bankruptcy prediction , 2011, Knowl. Based Syst..