论文信息 - Simple instance selection for bankruptcy prediction

Simple instance selection for bankruptcy prediction

Instance selection or outlier detection is an important task during data mining, which focuses on filtering out bad data from a given dataset. However, there is no rigid mathematical definition of what constitutes an outlier and an outlier is not a binary property. Therefore, different volumes of outliers may be detected depending on the setting of the threshold for what constitutes an outlier, e.g., the distance in distance-based outlier detection. In this study, we examine bankruptcy prediction performance achieved after removal of different outlier volumes from four widely used datasets, namely the Australian, German, Japanese, and UC Competition datasets. Specifically, a simple distance-based clustering outlier detection method is used. In addition, four popular classification techniques are compared, artificial neural networks, decision trees, logistic regression, and support vector machines. Experiments are conducted to examine (1) the prediction performance of the bankruptcy prediction models with and without instance selection, (2) the stability of bankruptcy prediction models after the removal of outliers from the testing set, and (3) the characteristics of these four different datasets. The results show that with the German dataset it is much more difficult for the prediction models to provide high rates of accuracy after outlier removal, while it is easier with the UC Competition dataset. Removing 50% of the outliers can lead to optimal performance of these four models. In addition, using the removed outliers to test the prediction accuracy of these models, we find that it is support vector machines (SVM) that provide the highest rate of prediction accuracy and perform with much more stability and good noise tolerance than the other three prediction models. Furthermore, the prediction accuracy of the SVM model followed by instance selection is similar to the one without instance selection (i.e., the SVM baseline). In other words, the difference in performance between the SVM and the SVM baseline is the least of the three models in comparison with their corresponding baselines.

Chih-Fong Tsai | Kai-Chun Cheng

[1] Patrick Haffner,et al. Support vector machines for histogram-based image classification , 1999, IEEE Trans. Neural Networks.

[2] Chih-Fong Tsai,et al. Using neural network ensembles for bankruptcy prediction and credit scoring , 2008, Expert Syst. Appl..

[3] Vladimir Vapnik,et al. Statistical learning theory , 1998 .

[4] Rajeev Rastogi,et al. Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD 2000.

[5] Philip S. Yu,et al. Outlier detection for high dimensional data , 2001, SIGMOD '01.

[6] Wang Jeen-Shing,et al. A Cluster Validity Measure With Outlier Detection for Support Vector Clustering , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[7] Victor R. L. Shen,et al. Verification of Knowledge-Based Systems Using Predicate/Transition Nets , 2008, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[8] Junyi Shen,et al. Detecting outlier samples in multivariate time series dataset , 2008, Knowl. Based Syst..

[9] Qingsheng Zhu,et al. Finding key attribute subset in dataset for outlier detection , 2011, Knowl. Based Syst..

[10] Simon Haykin,et al. Neural Networks: A Comprehensive Foundation , 1998 .

[11] Chih-Fong Tsai. Financial decision support using neural networks and support vector machines , 2008, Expert Syst. J. Knowl. Eng..