A hybrid under-sampling approach for mining unbalanced datasets: applications to banking and insurance

In solving unbalanced classification problems, machine learning algorithms are overwhelmed by the majority class and consequently misclassify the minority class observations. Here, we propose a hybrid under-sampling approach to improve the performance of classifiers. The proposed approach first employs k-reverse nearest neighbour (kRNN) method to detect the outliers from majority class. After removing the outliers, using K-means clustering, K-clusters are selected to further reduce the influence of the majority class. Then, we employed support vector machine (SVM), logistic regression (LR), multi layer perceptron (MLP), radial basis function network (RBF), group method of data handling (GMDH), genetic programming (GP) and decision tree (J48) for classification purpose. The effectiveness of the proposed approach was demonstrated on datasets taken from insurance fraud detection and credit card churn in banking domain. Ten-fold cross validation method was used in the study. It is observed that the proposed approach improved the performance of the classifiers.

[1]  Ralescu Anca,et al.  ISSUES IN MINING IMBALANCED DATA SETS - A REVIEW PAPER , 2005 .

[2]  Michael J. A. Berry,et al.  Mastering Data Mining: The Art and Science of Customer Relationship Management , 1999 .

[3]  Herna L. Viktor,et al.  Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach , 2004, SKDD.

[4]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[5]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[6]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[7]  David L. Mothersbaugh,et al.  Switching barriers and repurchase intentions in services , 2000 .

[8]  B. Lang,et al.  Efficient optimization of support vector machine learning parameters for unbalanced datasets , 2006 .

[9]  Dirk Van den Poel,et al.  Customer attrition analysis for financial services using proportional hazard models , 2004, Eur. J. Oper. Res..

[10]  Antoine Geissbühler,et al.  Learning from imbalanced data in surveillance of nosocomial infection , 2006, Artif. Intell. Medicine.

[11]  M. Maloof Learning When Data Sets are Imbalanced and When Costs are Unequal and Unknown , 2003 .

[12]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[13]  Patrick L. Brockett,et al.  Fraud Classification Using Principal Component Analysis of Ridits , 2002 .

[14]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[15]  Gustavo E. A. P. A. Batista,et al.  Learning with Skewed Class Distributions , 2002 .

[16]  Robert C. Holte,et al.  C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling , 2003 .

[17]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[18]  Taeho Jo,et al.  A Multiple Resampling Method for Learning from Imbalanced Data Sets , 2004, Comput. Intell..

[19]  Tom Fawcett,et al.  Combining Data Mining and Machine Learning for Effective User Profiling , 1996, KDD.

[20]  Vadlamani Ravi,et al.  Colon cancer prediction with genetic profiles using intelligent techniques , 2008, Bioinformation.

[21]  Vadlamani Ravi,et al.  A Semi-Online Training Algorithm for the Radial Basis Function Neural Networks: Applications to Bankruptcy Prediction in Banks , 2008 .

[22]  Hong Guo,et al.  Neural Learning from Unbalanced Data , 2004, Applied Intelligence.

[23]  Salvatore J. Stolfo,et al.  Cost-based modeling for fraud and intrusion detection: results from the JAM project , 2000, Proceedings DARPA Information Survivability Conference and Exposition. DISCEX'00.

[24]  Dimitris Kanellopoulos,et al.  Handling imbalanced datasets: A review , 2006 .

[25]  José Salvador Sánchez,et al.  Strategies for learning in class imbalance problems , 2003, Pattern Recognit..

[26]  Gongping Yang,et al.  On the Class Imbalance Problem , 2008, 2008 Fourth International Conference on Natural Computation.

[27]  Ana L. C. Bazzan,et al.  Improving Rule Induction Precision for Automated Annotation by Balancing Skewed Data Sets , 2004, KELSI.

[28]  Dirk Van den Poel,et al.  CRM at a pay-TV company: Using analytical models to reduce customer attrition by targeted marketing for subscription services , 2007, Expert Syst. Appl..

[29]  Salvatore J. Stolfo,et al.  Credit Card Fraud Detection Using Meta-Learning: Issues and Initial Results 1 , 1997 .

[30]  Alfred Ultsch Emergent self-organising feature maps used for prediction and prevention of churn in mobile phone markets , 2002 .

[31]  Tom Fawcett,et al.  Adaptive Fraud Detection , 1997, Data Mining and Knowledge Discovery.

[32]  Kaizhu Huang,et al.  Learning classifiers from imbalanced data based on biased minimax probability machine , 2004, CVPR 2004.

[33]  Gary M. Weiss Learning with Rare Cases and Small Disjuncts , 1995, ICML.

[34]  Xiaohui Liu,et al.  Combining multiple classifiers for wrapper feature selection , 2008, Int. J. Data Min. Model. Manag..

[35]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[36]  Damminda Alahakoon,et al.  Minority report in fraud detection: classification of skewed data , 2004, SKDD.

[37]  Salvatore J. Stolfo,et al.  JAM: Java Agents for Meta-Learning over Distributed Databases , 1997, KDD.

[38]  Jianping Zhang,et al.  Learning rules from highly unbalanced data sets , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[39]  Vadlamani Ravi,et al.  Predicting credit card customer churn in banks using data mining , 2008, Int. J. Data Anal. Tech. Strateg..

[40]  張 毓騰,et al.  APPLYING DATA MINING TO TELECOM CHURN MANAGEMENT , 2009 .

[41]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[42]  Dirk Van den Poel,et al.  Predicting customer retention and profitability by using random forests and regression forests techniques , 2005, Expert Syst. Appl..

[43]  Guido Dedene,et al.  A case study of applying boosting naive Bayes to claim fraud diagnosis , 2004, IEEE Transactions on Knowledge and Data Engineering.

[44]  Johannes Grotendorst,et al.  Classification of Highly Unbalanced CYP450 Data of Drugs Using Cost Sensitive Machine Learning Techniques , 2007, J. Chem. Inf. Model..

[45]  Dipti Srinivasan,et al.  Energy demand prediction using GMDH networks , 2008, Neurocomputing.

[46]  Kamalakar Karlapalem,et al.  A Simple Yet Effective Data Clustering Algorithm , 2006, Sixth International Conference on Data Mining (ICDM'06).

[47]  Salvatore J. Stolfo,et al.  Distributed data mining in credit card fraud detection , 1999, IEEE Intell. Syst..

[48]  Eric Johnson,et al.  Predicting subscriber dissatisfaction and improving retention in the wireless telecommunications industry , 2000, IEEE Trans. Neural Networks Learn. Syst..

[49]  T. Warren Liao,et al.  Classification of weld flaws with imbalanced class data , 2008, Expert Syst. Appl..

[50]  Haym Hirsh,et al.  The effect of small disjuncts and class distribution on decision tree learning , 2003 .

[51]  Chen Junjie,et al.  Application of Unbalanced Data Approach to Network Intrusion Detection , 2009, 2009 First International Workshop on Database Technology and Applications.

[52]  K. Ruyter,et al.  Investigating drivers of bank loyalty: the complex relationship between image, service quality and satisfaction , 1998 .

[53]  Nitesh V. Chawla,et al.  C4.5 and Imbalanced Data sets: Investigating the eect of sampling method, probabilistic estimate, and decision tree structure , 2003 .

[54]  Michiel C. van Wezel,et al.  Improved customer choice predictions using ensemble methods , 2005, Eur. J. Oper. Res..

[55]  Wagner A. Kamakura,et al.  Defection Detection: Measuring and Understanding the Predictive Accuracy of Customer Churn Models , 2006 .

[56]  Cheng-Seen Ho,et al.  Toward a hybrid data mining model for customer retention , 2007, Knowl. Based Syst..

[57]  Jatinder N. D. Gupta,et al.  Neural networks in business: techniques and applications for the operations researcher , 2000, Comput. Oper. Res..

[58]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[59]  Riccardo Poli,et al.  A Field Guide to Genetic Programming , 2008 .

[60]  Ted E. Senator,et al.  The Financial Crimes Enforcement Network AI System (FAIS) Identifying Potential Money Laundering from Reports of Large Cash Transactions , 1995, AI Mag..

[61]  David A. Cieslak,et al.  Learning Decision Trees for Unbalanced Data , 2008, ECML/PKDD.

[62]  T.M. Padmaja,et al.  Unbalanced data classification using extreme outlier elimination and sampling techniques for fraud detection , 2007, 15th International Conference on Advanced Computing and Communications (ADCOM 2007).

[63]  Jorma Laurikkala,et al.  Improving Identification of Difficult Small Classes by Balancing Class Distribution , 2001, AIME.

[64]  Charles X. Ling,et al.  Data Mining for Direct Marketing: Problems and Solutions , 1998, KDD.

[65]  Foster Provost,et al.  The effect of class distribution on classifier learning , 2001 .

[66]  Xiaohua Hu,et al.  A Data Mining Approach for Retailing Bank Customer Attrition Analysis , 2004, Applied Intelligence.

[67]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[68]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[69]  Dirk Van den Poel,et al.  Investigating the role of product features in preventing customer churn, by using survival analysis and choice modeling: The case of financial services , 2004, Expert Syst. Appl..

[70]  Rüdiger W. Brause,et al.  Neural data mining for credit card fraud detection , 1999, Proceedings 11th International Conference on Tools with Artificial Intelligence.

[71]  Marley M. B. R. Vellasco,et al.  Data Mining Techniques on the Evaluation of Wireless Churn , 2004, ESANN.

[72]  David C. Yen,et al.  Applying data mining to telecom churn management , 2006, Expert Syst. Appl..

[73]  Vadlamani Ravi,et al.  Data Mining Using Rules Extracted from SVM: An Application to Churn Prediction in Bank Credit Cards , 2009, RSFDGrC.

[74]  Dirk Van den Poel,et al.  Customer base analysis: partial defection of behaviourally loyal clients in a non-contractual FMCG retail setting , 2005, Eur. J. Oper. Res..

[75]  Wei-Lun Chang,et al.  Mixed-initiative synthesized learning approach for web-based CRM , 2001, Expert Syst. Appl..

[76]  P. K. Kannan,et al.  Implications of loyalty program membership and service experiences for customer retention and value , 2000 .

[77]  J. Stuart Aitken,et al.  Multiple algorithms for fraud detection , 2000, Knowl. Based Syst..

[78]  Stan Matwin,et al.  Machine Learning for the Detection of Oil Spills in Satellite Radar Images , 1998, Machine Learning.

[79]  B. Stefano,et al.  Insurance fraud evaluation: a fuzzy expert system , 2001, 10th IEEE International Conference on Fuzzy Systems. (Cat. No.01CH37297).

[80]  Andreas Stolcke,et al.  A study in machine learning from imbalanced data for sentence boundary detection in speech , 2006, Comput. Speech Lang..

[81]  V. Ravi,et al.  Sputter process variables prediction via data mining , 2004, IEEE Conference on Cybernetics and Intelligent Systems, 2004..

[82]  Yufeng Liu,et al.  Adaptive Weighted Learning for Unbalanced Multicategory Classification , 2009, Biometrics.

[83]  Ashutosh Tiwari,et al.  Computer assisted customer churn management: State-of-the-art and future trends , 2007, Comput. Oper. Res..

[84]  Xin Yao,et al.  A novel evolutionary data mining algorithm with applications to churn prediction , 2003, IEEE Trans. Evol. Comput..