A principle component analysis-based random forest with the potential nearest neighbor method for automobile insurance fraud identification

Abstract As a successful ensemble method, Random Forest has attracted much attention. In this paper, individual classifiers are appropriately combined and a multiple classifier system with an increase in classification accuracy is presented. According to Breiman’s methodology, we propose a multiple classifier system based on the Random Forest, Principle Component Analysis and Potential Nearest Neighbor methods As Breiman suggested, the performance of the Random Forest depends on the strength of the weak learners in the forests and diversity among them. The Principle Component Analysis method is applied to transform data at each node to another space when computing the best split at this node. This process increases the diversity of each tree in the forest and thereby improves the overall accuracy. The Random Forest is studied through the perspective of the Adaptive Nearest Neighbor. We introduce the concept of monotone distance measures and potential nearest neighbors and show that the Random Forest can be viewed as an adaptive learning mechanism of k Potential Nearest Neighbors. Considering the information loss caused by out-of-bag samples, a new voting mechanism based on Potential Nearest Neighbor is also presented to replace the traditional majority vote. The proposed algorithm improves the classification accuracy of the ensemble classifier by improving the difference of the base classifiers. The performance of the proposed method is compared with those of the Oblique Decision Tree Ensemble, Rotation Forest and basic Random Forest on the data sets. The experimental results show that the proposed method produces a better classification accuracy and lower variance. The proposed method is also applied to detect automobile insurance fraud, and the fraud rules are obtained.

[1]  Ponnuthurai N. Suganthan,et al.  Ensemble Classification and Regression-Recent Developments, Applications and Future Directions [Review Article] , 2016, IEEE Computational Intelligence Magazine.

[2]  Ajith Abraham,et al.  Modeling Insurance Fraud Detection Using Imbalanced Data Classification , 2015, NaBIC.

[3]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[4]  Zachary Eyler-Walker,et al.  Closing the gap: automated screening of tax returns to identify egregious tax shelters , 2006, SKDD.

[5]  Yali Amit,et al.  Shape Quantization and Recognition with Randomized Trees , 1997, Neural Computation.

[6]  Wonsuk Yoo,et al.  A Comparison of Logistic Regression, Logic Regression, Classification Tree, and Random Forests to Identify Effective Gene-Gene and Gene-Environmental Interactions. , 2012, International journal of applied science and technology.

[7]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[8]  Ponnuthurai N. Suganthan,et al.  Towards generating random forests via extremely randomized trees , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[9]  Wei Liu,et al.  Research and application of random forest model in mining automobile insurance fraud , 2016, 2016 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD).

[10]  Yunming Ye,et al.  Stratified sampling for feature subspace selection in random forests for high dimensional data , 2013, Pattern Recognit..

[11]  Josef Kittler,et al.  A new approach to feature selection based on the Karhunen-Loeve expansion , 1973, Pattern Recognit..

[12]  Francisco Herrera,et al.  On the use of MapReduce for imbalanced big data using Random Forest , 2014, Inf. Sci..

[13]  Hans-Peter Piepho,et al.  A comparison of random forests, boosting and support vector machines for genomic selection , 2011, BMC proceedings.

[14]  P. N. Suganthan,et al.  Empirical comparison of bagging-based ensemble classifiers , 2012, 2012 15th International Conference on Information Fusion.

[15]  M. Ozer,et al.  Comparison of the Effects of Cross-validation Methods on Determining Performances of Classifiers Used in Diagnosing Congestive Heart Failure , 2015 .

[16]  Ponnuthurai N. Suganthan,et al.  Modeling of steelmaking process with effective machine learning techniques , 2015, Expert Syst. Appl..

[17]  Ji Feng,et al.  Deep Forest: Towards An Alternative to Deep Neural Networks , 2017, IJCAI.

[18]  Yufei Jin,et al.  Binary choice models for rare events data: a crop insurance fraud application , 2005 .

[19]  Donald A. Adjeroh,et al.  Random KNN feature selection - a fast and stable alternative to Random Forests , 2011, BMC Bioinformatics.

[20]  Ponnuthurai N. Suganthan,et al.  Oblique Decision Tree Ensemble via Multisurface Proximal Support Vector Machine , 2015, IEEE Transactions on Cybernetics.

[21]  Vaidyanathan K. Jayaraman,et al.  Biogeography-based informative gene selection and cancer classification using SVM and Random Forests , 2012, 2012 IEEE Congress on Evolutionary Computation.

[22]  F. Attneave,et al.  The Organization of Behavior: A Neuropsychological Theory , 1949 .

[23]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[24]  A. Prasad,et al.  Newer Classification and Regression Tree Techniques: Bagging and Random Forests for Ecological Prediction , 2006, Ecosystems.

[25]  H. Lookman Sithic,et al.  Survey of Insurance Fraud Detection Using Data Mining Techniques , 2013, ArXiv.

[26]  Le Zhang,et al.  A survey of randomized algorithms for training neural networks , 2016, Inf. Sci..

[27]  Yi Lin,et al.  Random Forests and Adaptive Nearest Neighbors , 2006 .

[28]  Allan R. Wilks,et al.  Fraud Detection in Telecommunications: History and Lessons Learned , 2010, Technometrics.

[29]  Sharon Tennyson,et al.  Claims Auditing in Automobile Insurance: Fraud Detection and Deterrence Objectives , 2002 .

[30]  Ponnuthurai N. Suganthan,et al.  Random Forests with ensemble of feature spaces , 2014, Pattern Recognit..

[31]  José Manuel Cadenas,et al.  Fundamentals for Design and Construction of a Fuzzy Random Forest , 2010 .

[32]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[33]  Ponnuthurai N. Suganthan,et al.  K-nearest neighbor based bagging SVM pruning , 2013, 2013 IEEE Symposium on Computational Intelligence and Ensemble Learning (CIEL).

[34]  Ponnuthurai N. Suganthan,et al.  Ensemble strategies with adaptive evolutionary programming , 2010, Inf. Sci..

[35]  Bjoern H. Menze,et al.  A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data , 2009, BMC Bioinformatics.

[36]  Giorgio Valentini,et al.  Bias-Variance Analysis of Support Vector Machines for the Development of SVM-Based Ensemble Methods , 2004, J. Mach. Learn. Res..

[37]  Marko Bajec,et al.  An expert system for detecting automobile insurance fraud using social network analysis , 2011, Expert Syst. Appl..

[38]  Richard A. Berk Classification and Regression Trees (CART) , 2008 .

[39]  Islem Rekik,et al.  Deep random forest-based learning transfer to SVM for brain tumor segmentation , 2016, 2016 2nd International Conference on Advanced Technologies for Signal and Image Processing (ATSIP).

[40]  Gustavo Carneiro,et al.  Automated Mass Detection in Mammograms Using Cascaded Deep Learning and Random Forests , 2015, 2015 International Conference on Digital Image Computing: Techniques and Applications (DICTA).