A comprehensive survey of methods for overcoming the class imbalance problem in fraud detection

Fraud is a hugely costly criminal activity which occurs in a number of domains. Data mining has been applied to fraud detection in both a supervised and non-supervised manner. When a supervised data mining approach is used, one of the biggest problems that is encountered, is the problem of class imbalance. The class imbalance problem is not unique to the domain of fraud, but also occurs in fields as diverse as medical diagnosis and quality control. There are two basic means of overcoming the class imbalance problem, these are data methods and algorithmic methods. Data methods generally involve an under sampling, over sampling or hybrid over/under sampling approach. Other data method investigated include SMOTE, which uses a K-NN learner to artificially synthesize minority class samples. Algorithmic methods investigated include using either a mis-classification cost in the case of the Metacost procedure or metacost thresholds. Other algorithmic methods include the use of learners which are not sensitive to the class imbalance problem. The different methods for overcoming the class imbalance problem are implemented using open-source software. 3 datasets are used to investigate the usefulness of the different methods. 2 of the datasets are from the domain of fraud, while the third is from the domain of medical diagnosis.

[1]  Sungzoon Cho,et al.  Improved response modeling based on clustering, under-sampling, and ensemble , 2012, Expert Syst. Appl..

[2]  D. Yen,et al.  Identifying the signs of fraudulent accounts using data mining techniques , 2010, Comput. Hum. Behav..

[3]  Hui Li,et al.  Application of Random-SMOTE on Imbalanced Data Mining , 2011, 2011 Fourth International Conference on Business Intelligence and Financial Engineering.

[4]  Xiuzhen Zhang,et al.  Improving k Nearest Neighbor with Exemplar Generalization for Imbalanced Classification , 2011, PAKDD.

[5]  Chumphol Bunkhumpornpat,et al.  DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique , 2011, Applied Intelligence.

[6]  Luís Torgo,et al.  Data Mining with R: Learning with Case Studies , 2010 .

[7]  Ross W. Gayler,et al.  A Comprehensive Survey of Data Mining-based Fraud Detection Research , 2010, ArXiv.

[8]  Zhi-Hua Zhou,et al.  ON MULTI‐CLASS COST‐SENSITIVE LEARNING , 2006, Comput. Intell..

[9]  Shamik Sural,et al.  Credit card fraud detection: A fusion approach using Dempster-Shafer theory and Bayesian learning , 2009, Inf. Fusion.

[10]  Xiaowei Yang,et al.  Several SVM Ensemble Methods Integrated with Under-Sampling for Imbalanced Data Learning , 2009, ADMA.

[11]  Mykola Pechenizkiy,et al.  Predicting Students Drop Out: A Case Study , 2009, EDM.

[12]  Xin Yao,et al.  Diversity exploration and negative correlation learning on imbalanced data sets , 2009, 2009 International Joint Conference on Neural Networks.

[13]  Yue-Shi Lee,et al.  Cluster-based under-sampling approaches for imbalanced data distributions , 2009, Expert Syst. Appl..

[14]  Glenn J. Myatt,et al.  Making Sense of Data II , 2009 .

[15]  Calton Pu,et al.  Collaborative Computing: Networking, Applications and Worksharing , 2009 .

[16]  Zhi-Hua Zhou,et al.  Exploratory Undersampling for Class-Imbalance Learning , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[17]  Nikolai Gagunashvili,et al.  Application of the rule-growing algorithm RIPPER to particle physics analysis , 2009, 0910.1729.

[18]  Gongping Yang,et al.  On the Class Imbalance Problem , 2008, 2008 Fourth International Conference on Natural Computation.

[19]  Chun-Chin Hsu,et al.  An information granulation based data mining approach for classifying imbalanced data , 2008, Inf. Sci..

[20]  Marc G. Gertz,et al.  Public perceptions of white-collar crime and punishment , 2008 .

[21]  Rebecca Saltiel Busch Healthcare Fraud: Auditing and Detection Guide , 2007 .

[22]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[23]  William Ribarsky,et al.  WireVis: Visualization of Categorical, Time-Varying Data From Financial Transactions , 2007, 2007 IEEE Symposium on Visual Analytics Science and Technology.

[24]  Mo-Yuen Chow,et al.  Power Distribution Outage Cause Identification With Imbalanced Data Using Artificial Immune Recognition System (AIRS) Algorithm , 2007, IEEE Transactions on Power Systems.

[25]  Gary M. Weiss,et al.  Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs? , 2007, DMIN.

[26]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, Sixth International Conference on Data Mining (ICDM'06).

[27]  Jau-Hwang Wang,et al.  Technology-based Financial Frauds in Taiwan: Issues and Approaches , 2006, 2006 IEEE International Conference on Systems, Man and Cybernetics.

[28]  Yuehwern Yih,et al.  Knowledge acquisition through information granulation for imbalanced data , 2006, Expert Syst. Appl..

[29]  R. Derrig,et al.  Auto Insurance Fraud: Measurements and Efforts to Combat it , 2006 .

[30]  Long-Sheng Chen,et al.  A neural network based information granulation approach to shorten the cellular phone test process , 2006, Comput. Ind..

[31]  Kurt Hornik,et al.  Support Vector Machines in R , 2006 .

[32]  Mo-Yuen Chow,et al.  A classification approach for power distribution systems fault cause identification , 2006, IEEE Transactions on Power Systems.

[33]  Dimitris Kanellopoulos,et al.  Handling imbalanced datasets: A review , 2006 .

[34]  Zhi-Hua Zhou,et al.  Ieee Transactions on Knowledge and Data Engineering 1 Training Cost-sensitive Neural Networks with Methods Addressing the Class Imbalance Problem , 2022 .

[35]  Xiao-Ping Zhang,et al.  Advances in Intelligent Computing, International Conference on Intelligent Computing, ICIC 2005, Hefei, China, August 23-26, 2005, Proceedings, Part I , 2005, ICIC.

[36]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[37]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[38]  Adam Kowalczyk,et al.  Extreme re-balancing for SVMs: a case study , 2004, SKDD.

[39]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[40]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[41]  Damminda Alahakoon,et al.  Minority report in fraud detection: classification of skewed data , 2004, SKDD.

[42]  Bernard Zenko,et al.  Is Combining Classifiers with Stacking Better than Selecting the Best One? , 2004, Machine Learning.

[43]  Stan Matwin,et al.  Machine Learning for the Detection of Oil Spills in Satellite Radar Images , 1998, Machine Learning.

[44]  Brian R. Gaines,et al.  Induction of ripple-down rules applied to modeling large databases , 1995, Journal of Intelligent Information Systems.

[45]  Robert C. Holte,et al.  Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[46]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[47]  Chao Chen,et al.  Using Random Forest to Learn Imbalanced Data , 2004 .

[48]  Tom Fawcett,et al.  "In vivo" spam filtering: a challenge problem for KDD , 2003, SKDD.

[49]  Gustavo E. A. P. A. Batista,et al.  An analysis of four missing data treatment methods for supervised learning , 2003, Appl. Artif. Intell..

[50]  José Salvador Sánchez,et al.  Strategies for learning in class imbalance problems , 2003, Pattern Recognit..

[51]  Robert C. Holte,et al.  C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling , 2003 .

[52]  Panayiotis E. Pintelas,et al.  Mixture of Expert Agents for Handling Imbalanced Data Sets , 2003 .

[53]  Nitesh V. Chawla,et al.  C4.5 and Imbalanced Data sets: Investigating the eect of sampling method, probabilistic estimate, and decision tree structure , 2003 .

[54]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[55]  R. Barandelaa,et al.  Strategies for learning in class imbalance problems , 2003, Pattern Recognit..

[56]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[57]  Guido Dedene,et al.  A Comparison of State-of-The-Art Classification Techniques for Expert Automobile Insurance Claim Fraud Detection , 2002 .

[58]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[59]  L. Weiss License to Steal: How Fraud Bleeds America's Health Care System , 2001 .

[60]  Nathalie Japkowicz,et al.  The Class Imbalance Problem: Significance and Strategies , 2000 .

[61]  N. Japkowicz Learning from Imbalanced Data Sets: A Comparison of Various Strategies * , 2000 .

[62]  Ian H. Witten,et al.  Issues in Stacked Generalization , 2011, J. Artif. Intell. Res..

[63]  Charles X. Ling,et al.  Data Mining for Direct Marketing: Problems and Solutions , 1998, KDD.

[64]  P. Brockett,et al.  Using Kohonen's Self-Organizing Feature Map to Uncover Automobile Bodily Injury Claims Fraud , 1998 .

[65]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[66]  Angelika I. Kokkinaki,et al.  On atypical database transactions: identification of probable frauds using machine learning for user profiling , 1997, Proceedings 1997 IEEE Knowledge and Data Engineering Exchange Workshop.

[67]  Barry G. Becker Using MineSet for Knowledge Discovery , 1997, IEEE Computer Graphics and Applications.

[68]  Evangelos Simoudis,et al.  Mining business databases , 1996, CACM.

[69]  Gregory Piatetsky-Shapiro,et al.  Selecting and reporting What Is Interesting , 1996, Advances in Knowledge Discovery and Data Mining.

[70]  Ted E. Senator,et al.  The Financial Crimes Enforcement Network AI System (FAIS) Identifying Potential Money Laundering from Reports of Large Cash Transactions , 1995, AI Mag..

[71]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[72]  Douglas L. Reilly,et al.  Credit card fraud detection with a neural-network , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[73]  John A. Major,et al.  EFD: A hybrid knowledge/statistical‐based system for the detection of fraud , 1992, Int. J. Intell. Syst..

[74]  Robert C. Holte,et al.  Concept Learning and the Problem of Small Disjuncts , 1989, IJCAI.

[75]  Paul Compton,et al.  Inductive knowledge acquisition: a case study , 1987 .

[76]  B. Marx The Visual Display of Quantitative Information , 1985 .

[77]  Susan Guarino Ghezzi,et al.  A PRIVATE NETWORK OF SOCIAL CONTROL: INSURANCE INVESTIGATION UNITS* , 1983 .

[78]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[79]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.