RAMOBoost: Ranked Minority Oversampling in Boosting

In recent years, learning from imbalanced data has attracted growing attention from both academia and industry due to the explosive growth of applications that use and produce imbalanced data. However, because of the complex characteristics of imbalanced data, many real-world solutions struggle to provide robust efficiency in learning-based applications. In an effort to address this problem, this paper presents Ranked Minority Oversampling in Boosting (RAMOBoost), which is a RAMO technique based on the idea of adaptive synthetic data generation in an ensemble learning system. Briefly, RAMOBoost adaptively ranks minority class instances at each learning iteration according to a sampling probability distribution that is based on the underlying data distribution, and can adaptively shift the decision boundary toward difficult-to-learn minority and majority class instances by using a hypothesis assessment procedure. Simulation analysis on 19 real-world datasets assessed over various metrics-including overall accuracy, precision, recall, F-measure, G-mean, and receiver operation characteristic analysis-is used to illustrate the effectiveness of this method.

[1]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[2]  David A. Cieslak,et al.  Automatically countering imbalance and its empirical relationship to cost , 2008, Data Mining and Knowledge Discovery.

[3]  Nitesh V. Chawla,et al.  C4.5 and Imbalanced Data sets: Investigating the eect of sampling method, probabilistic estimate, and decision tree structure , 2003 .

[4]  Robi Polikar,et al.  Learn$^{++}$ .NC: Combining Ensemble of Classifiers With Dynamically Weighted Consult-and-Vote for Efficient Incremental Learning of New Classes , 2009, IEEE Transactions on Neural Networks.

[5]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[6]  Stan Matwin,et al.  Machine Learning for the Detection of Oil Spills in Satellite Radar Images , 1998, Machine Learning.

[7]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[8]  Jingbo Zhu,et al.  Active Learning for Word Sense Disambiguation with Methods for Addressing the Class Imbalance Problem , 2007, EMNLP.

[9]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[10]  S. Clearwater,et al.  A rule-learning program in high energy physics event classification , 1991 .

[11]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[12]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[13]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[14]  Yi-Hung Liu,et al.  Face Recognition Using Total Margin-Based Adaptive Fuzzy Support Vector Machines , 2007, IEEE Transactions on Neural Networks.

[15]  Chunhua Shen,et al.  Boosting Through Optimization of Margin Distributions , 2009, IEEE Transactions on Neural Networks.

[16]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[17]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Data Mining Researchers , 2003 .

[18]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[19]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[20]  Kai Ming Ting,et al.  An Instance-weighting Method to Induce Cost-sensitive Trees , 2001 .

[21]  Tom Fawcett,et al.  Analysis and Visualization of Classifier Performance: Comparison under Imprecise Class and Cost Distributions , 1997, KDD.

[22]  Tom Fawcett,et al.  Robust Classification for Imprecise Environments , 2000, Machine Learning.

[23]  Salvatore J. Stolfo,et al.  AdaCost: Misclassification Cost-Sensitive Boosting , 1999, ICML.

[24]  Haibo He,et al.  A Ranked Subspace Learning Method for Gene Expression Data Classification , 2007, IC-AI.

[25]  C. Lee Giles,et al.  Learning on the border: active learning in imbalanced data classification , 2007, CIKM '07.

[26]  Xin Yao,et al.  Sparse Approximation Through Boosting for Learning Large Scale Kernel Machines , 2010, IEEE Transactions on Neural Networks.

[27]  Nicolás García-Pedrajas,et al.  Constructing Ensembles of Classifiers by Means of Weighted Instance Selection , 2009, IEEE Transactions on Neural Networks.

[28]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[29]  Gustavo E. A. P. A. Batista,et al.  Class Imbalances versus Class Overlapping: An Analysis of a Learning System Behavior , 2004, MICAI.

[30]  Xindong Wu,et al.  Error Detection and Impact-Sensitive Instance Ranking in Noisy Datasets , 2004, AAAI.

[31]  Sheng Chen,et al.  A Kernel-Based Two-Class Classifier for Imbalanced Data Sets , 2007, IEEE Transactions on Neural Networks.

[32]  David Mease,et al.  Boosted Classification Trees and Class Probability/Quantile Estimation , 2007, J. Mach. Learn. Res..

[33]  Sotiris B. Kotsiantis,et al.  Robustness of learning techniques in handling class noise in imbalanced datasets , 2007, AIAI.

[34]  Zhi-Hua Zhou,et al.  Ieee Transactions on Knowledge and Data Engineering 1 Training Cost-sensitive Neural Networks with Methods Addressing the Class Imbalance Problem , 2022 .

[35]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[36]  D. Opitz,et al.  Popular Ensemble Methods: An Empirical Study , 1999, J. Artif. Intell. Res..

[37]  L. Breiman Arcing Classifiers , 1998 .

[38]  L. Breiman Arcing classifier (with discussion and a rejoinder by the author) , 1998 .

[39]  Edward Y. Chang,et al.  KBA: kernel boundary alignment considering imbalanced data distribution , 2005, IEEE Transactions on Knowledge and Data Engineering.

[40]  David J. Hand,et al.  Measuring classifier performance: a coherent alternative to the area under the ROC curve , 2009, Machine Learning.

[41]  Pedro Antonio Gutiérrez,et al.  Sensitivity Versus Accuracy in Multiclass Problems Using Memetic Pareto Evolutionary Neural Networks , 2010, IEEE Transactions on Neural Networks.

[42]  Herna L. Viktor,et al.  Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach , 2004, SKDD.

[43]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[44]  Wei Hu,et al.  AdaBoost-Based Algorithm for Network Intrusion Detection , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[45]  Edward Y. Chang,et al.  Aligning boundary in kernel space for learning imbalanced dataset , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[46]  Gregory W. Corder,et al.  Nonparametric Statistics for Non-Statisticians: A Step-by-Step Approach , 2009 .

[47]  C. Lee Giles,et al.  Active learning for class imbalance problem , 2007, SIGIR.

[48]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[49]  N. Japkowicz Learning from Imbalanced Data Sets: A Comparison of Various Strategies * , 2000 .

[50]  Nuno Vasconcelos,et al.  Asymmetric boosting , 2007, ICML '07.

[51]  Bo Zhang,et al.  Learning concepts from large scale imbalanced data sets using support cluster machines , 2006, MM '06.

[52]  M. Maloof Learning When Data Sets are Imbalanced and When Costs are Unequal and Unknown , 2003 .

[53]  Paul A. Viola,et al.  Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade , 2001, NIPS.