UGRWO-Sampling: A modified random walk under-sampling approach based on graphs to imbalanced data classification

In this paper, we propose a new RWO-Sampling (Random Walk Over-Sampling) based on graphs for imbalanced datasets. In this method, two figures based on under-sampling and over-sampling methods are introduced to keep the proximity information, which is robust to noises and outliers. After the construction of the first graph on minority class, RWO-Sampling will be implemented on selected samples, and the rest of them will remain unchanged. The second graph is constructed for the majority class, and the samples in a low-density area (outliers) are removed. In the proposed method, examples of the majority class in a high-density area are selected, and the rest of them are eliminated. Furthermore, utilizing RWO-sampling, the boundary of minority class is increased though, the outliers are not raised. This method is tested, and the number of evaluation measures is compared to previous methods on nine continuous attribute datasets with different over-sampling rates. The experimental results were an indicator of the high efficiency and flexibility of the proposed method for the classification of imbalanced data.

[1]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[2]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[3]  Hossein Safari,et al.  QLMC-HD: Quasi Large Margin Classifier based on Hyperdisk , 2019, ArXiv.

[4]  Yuchun Tang,et al.  Spam Sender Detection with Classification Modeling on Highly Imbalanced Mail Server Behavior Data , 2008, Artificial Intelligence and Pattern Recognition.

[5]  Ron Kohavi,et al.  The Case against Accuracy Estimation for Comparing Induction Algorithms , 1998, ICML.

[6]  Edward Y. Chang,et al.  KBA: kernel boundary alignment considering imbalanced data distribution , 2005, IEEE Transactions on Knowledge and Data Engineering.

[7]  Alvaro Soto,et al.  Active learning and subspace clustering for anomaly detection , 2011, Intell. Data Anal..

[8]  Huaxiang Zhang,et al.  RWO-Sampling: A random walk over-sampling approach to imbalanced data classification , 2014, Inf. Fusion.

[9]  Rohini K. Srihari,et al.  Feature selection for text categorization on imbalanced data , 2004, SKDD.

[10]  Chumphol Bunkhumpornpat,et al.  DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique , 2011, Applied Intelligence.

[11]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[12]  Ee-Peng Lim,et al.  On strategies for imbalanced text classification using SVM: A comparative study , 2009, Decis. Support Syst..

[13]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[14]  Vipin Kumar,et al.  Evaluating boosting algorithms to classify rare classes: comparison and improvements , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[15]  Fuchun Sun,et al.  Joint Block Structure Sparse Representation for Multi-Input–Multi-Output (MIMO) T–S Fuzzy System Identification , 2014, IEEE Transactions on Fuzzy Systems.

[16]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[17]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[18]  Jorma Laurikkala,et al.  Instance-based data reduction for improved identification of difficult small classes , 2002, Intell. Data Anal..

[19]  Dimitris Kanellopoulos,et al.  Handling imbalanced datasets: A review , 2006 .

[20]  Ralf Stecking,et al.  Using Multiple SVM Models for Unbalanced Credit Scoring Data Sets , 2007, GfKl.

[21]  Edward Y. Chang,et al.  Class-Boundary Alignment for Imbalanced Dataset Learning , 2003 .

[22]  Ali Amiri,et al.  Weighted second-order cone programming twin support vector machine for imbalanced data classification , 2019, ArXiv.

[23]  Yue-Shi Lee,et al.  Cluster-based under-sampling approaches for imbalanced data distributions , 2009, Expert Syst. Appl..

[24]  Pedro Antonio Gutiérrez,et al.  Graph-Based Approaches for Over-Sampling in the Context of Ordinal Regression , 2015, IEEE Transactions on Knowledge and Data Engineering.

[25]  Xin Yao,et al.  MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning , 2014 .

[26]  Fabio Roli,et al.  Intrusion detection in computer networks by a modular ensemble of one-class classifiers , 2008, Inf. Fusion.

[27]  Charles X. Ling,et al.  Data Mining for Direct Marketing: Problems and Solutions , 1998, KDD.

[28]  Ying He,et al.  MSMOTE: Improving Classification Performance When Training Data is Imbalanced , 2009, 2009 Second International Workshop on Computer Science and Engineering.