A Machine Learning based Approach to Reduce Behavioral Noise Problem in an Imbalanced Data: Application to a fraud detection

The question of class imbalance has become more pronounced with the application of learning algorithms in real applications. It has received significant attention in the machine learning and data mining community. This problem is present in fraud detection, medical diagnostics, and a number of other areas where training data contains significantly more representatives of one class (called the majority class) than the other class (called the minority class). Machine learning techniques struggle to deal with imbalanced data by focusing on minimizing the error rate for the majority class while ignoring the minority class, which is the most interesting from a learning point of view and also involves a high cost when it is not well classified. However, the imbalance ratio is not the only cause of poor performance when learning from imbalanced data. Another critical factor that accompanies imbalanced data in the real world is the presence of a number of instances of the two classes being overlapped in feature space. This problem is commonly referred to as class overlap and we have called it “behavioral noise”. In this paper, we propose One Side Behavioral Noise Reduction (OSBNR) approach to deal with the problem of class imbalance in the presence of a behavioral noise level. OSBNR is based on two stages. Firstly, a clustering is applied to groups similar instances of the minority class in multiple behavior clusters. Secondly, we select and eliminate instances of the majority class, considered as behavioral noise, which overlap with the behavior clusters of the minority class. The results of experiments conducted on a representative public dataset confirm that the proposed approach is effective for class imbalance problem in the presence of behavioral noise.

[1]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[2]  Antônio de Pádua Braga,et al.  Novel Cost-Sensitive Approach to Improve the Multilayer Perceptron Performance on Imbalanced Data , 2013, IEEE Transactions on Neural Networks and Learning Systems.

[3]  G. Niveditha,et al.  Credit Card Fraud Detection using Random Forest Algorithm , 2019, International Journal for Research in Applied Science and Engineering Technology.

[4]  Gustavo E. A. P. A. Batista,et al.  Class Imbalances versus Class Overlapping: An Analysis of a Learning System Behavior , 2004, MICAI.

[5]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[6]  Reid A. Johnson,et al.  Calibrating Probability with Undersampling for Unbalanced Classification , 2015, 2015 IEEE Symposium Series on Computational Intelligence.

[7]  I. Tomek An Experiment with the Edited Nearest-Neighbor Rule , 1976 .

[8]  Taghi M. Khoshgoftaar,et al.  An Empirical Study of the Classification Performance of Learners on Imbalanced and Noisy Software Quality Data , 2007, 2007 IEEE International Conference on Information Reuse and Integration.

[9]  Nitesh V. Chawla,et al.  SPECIAL ISSUE ON LEARNING FROM IMBALANCED DATA SETS , 2004 .

[10]  Eşref Adalı,et al.  Multilayer perceptron neural network technique for fraud detection , 2017, 2017 International Conference on Computer Science and Engineering (UBMK).

[11]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[12]  Sanjay Ranka,et al.  An effic ient k-means clustering algorithm , 1997 .

[13]  Haifeng Hong,et al.  Learning from Imbalanced Data: A Comparative Study , 2019, SocialSec.

[14]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[15]  M. Mostafizur Rahman,et al.  Cluster Based Under-Sampling for Unbalanced Cardiovascular Data , 2013 .

[16]  Chris D. Nugent,et al.  Undersampling Near Decision Boundary for Imbalance Problems , 2019, 2019 International Conference on Machine Learning and Cybernetics (ICMLC).

[17]  Sohail Asghar,et al.  A Classification Model For Class Imbalance Dataset Using Genetic Programming , 2019, IEEE Access.

[18]  Francisco Herrera,et al.  EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling , 2013, Pattern Recognit..

[19]  Bartosz Krawczyk,et al.  Learning from imbalanced data: open challenges and future directions , 2016, Progress in Artificial Intelligence.

[20]  Chih-Fong Tsai,et al.  Clustering-based undersampling in class-imbalanced data , 2017, Inf. Sci..

[21]  C. Victoria Priscilla,et al.  Credit Card Fraud Detection: A Systematic Review , 2020 .

[22]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[23]  Jorma Laurikkala,et al.  Improving Identification of Difficult Small Classes by Balancing Class Distribution , 2001, AIME.

[24]  Francisco Herrera,et al.  Learning from Imbalanced Data Sets , 2018, Springer International Publishing.

[25]  Xingquan Zhu,et al.  Class Noise vs. Attribute Noise: A Quantitative Study , 2003, Artificial Intelligence Review.

[26]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[27]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[28]  Fernando Bação,et al.  Oversampling for Imbalanced Learning Based on K-Means and SMOTE , 2017, Inf. Sci..

[29]  Dazhe Zhao,et al.  An Optimized Cost-Sensitive SVM for Imbalanced Data Learning , 2013, PAKDD.

[30]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[31]  Andrew K. C. Wong,et al.  Classification of Imbalanced Data: a Review , 2009, Int. J. Pattern Recognit. Artif. Intell..

[32]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[33]  Siti Mariyam Shamsuddin,et al.  Classification with class imbalance problem: A review , 2015, SOCO 2015.