Unbalanced data classification using extreme outlier elimination and sampling techniques for fraud detection

Detecting fraud from the highly overlapped and imbalanced fraud dataset is a challenging task. To solve this problem, we propose a new approach called extreme outlier elimination and hybrid sampling technique, k reverse 'nearest neighbors (kRNNs) concept used as a data cleaning method for eliminating extreme outliers in minority regions. Hybrid sampling technique, a combination of SMOTE to over-sample the minority data (fraud samples) and random under- sampling to under-sample the majority data (non-fraud samples) is used for improving the fraud detection accuracy. This method was evaluated in terms of True Positive rate and True Negative rate on the insurance fraud dataset. We conducted the experiments with classifiers namely C4.5, naive Bayes, k-NN and Radial Basis Function networks and compared the performance of our approach against simple hybrid sampling technique. Obtained results shown that extreme outlier elimination from minority class, produce high predictions for both fraud and non-fraud classes.

[1]  Damminda Alahakoon,et al.  Minority report in fraud detection: classification of skewed data , 2004, SKDD.

[2]  Salvatore J. Stolfo,et al.  JAM: Java Agents for Meta-Learning over Distributed Databases , 1997, KDD.

[3]  Melissa Dark,et al.  Teaching Students to Design Secure Systems , 2003, IEEE Secur. Priv..

[4]  Rolf Oppliger,et al.  Does trusted computing remedy computer security problems? , 2005, IEEE Security & Privacy Magazine.

[5]  Taeho Jo,et al.  A Multiple Resampling Method for Learning from Imbalanced Data Sets , 2004, Comput. Intell..

[6]  Salvatore J. Stolfo,et al.  Cost-based modeling for fraud and intrusion detection: results from the JAM project , 2000, Proceedings DARPA Information Survivability Conference and Exposition. DISCEX'00.

[7]  Alfred Menezes,et al.  Handbook of Applied Cryptography , 2018 .

[8]  Kenneth G. Paterson,et al.  Attacking the IPsec Standards in Encryption-only Configurations , 2007, 2007 IEEE Symposium on Security and Privacy (SP '07).

[9]  Bruce Schneier,et al.  Practical cryptography , 2003 .

[10]  Tao Guo,et al.  Neural data mining for credit card fraud detection , 2008, 2008 International Conference on Machine Learning and Cybernetics.

[11]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[12]  Eugene H. Spafford,et al.  A failure to learn from the past , 2003, 19th Annual Computer Security Applications Conference, 2003. Proceedings..

[13]  Sean W. Smith,et al.  Fairy Dust, Secrets, and the Real World , 2003, IEEE Secur. Priv..

[14]  Charles X. Ling,et al.  Data Mining for Direct Marketing: Problems and Solutions , 1998, KDD.

[15]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[16]  Kamalakar Karlapalem,et al.  A Simple Yet Effective Data Clustering Algorithm , 2006, Sixth International Conference on Data Mining (ICDM'06).

[17]  Salvatore J. Stolfo,et al.  Distributed data mining in credit card fraud detection , 1999, IEEE Intell. Syst..

[18]  Whitfield Diffie,et al.  New Directions in Cryptography , 1976, IEEE Trans. Inf. Theory.

[19]  Butler W. Lampson,et al.  31. Paper: Computer Security in the Real World Computer Security in the Real World , 2022 .

[20]  James P Anderson,et al.  Computer Security Technology Planning Study , 1972 .

[21]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[22]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[23]  Ross J. Anderson Why cryptosystems fail , 1993, CCS '93.

[24]  Ken Thompson,et al.  Reflections on trusting trust , 1984, CACM.

[25]  Gary McGraw,et al.  Knowledge for Software Security , 2005, IEEE Secur. Priv..

[26]  Ian Witten,et al.  Data Mining , 2000 .

[27]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[28]  R. Asokan,et al.  Digital signatures and electronic documents: a cautionary tale , 2002, Communications and Multimedia Security.

[29]  J. Stuart Aitken,et al.  Multiple algorithms for fraud detection , 2000, Knowl. Based Syst..

[30]  Tom Fawcett,et al.  Adaptive Fraud Detection , 1997, Data Mining and Knowledge Discovery.

[31]  Tony R. Martinez,et al.  Improved Heterogeneous Distance Functions , 1996, J. Artif. Intell. Res..

[32]  Kate Smith-Miles,et al.  A Comprehensive Survey of Data Mining-based Fraud Detection Research , 2010, ArXiv.

[33]  Jerome H. Saltier,et al.  Protection of information in computer systems , 1975, IEEE CSIT Newsletter.

[34]  Salvatore J. Stolfo,et al.  Credit Card Fraud Detection Using Meta-Learning: Issues and Initial Results 1 , 1997 .