Density-induced oversampling for highly imbalanced datasets

The problem of highly imbalanced datasets with only sparse data of the minority class in the context of two class classification is investigated. A novel synthetic data oversampling technique is proposed which utilizes estimations of the probability density distribution in the feature space. First, a Gaussian mixture model (GMM) from the data of the well-sampled majority class is generated and with its help a new GMM is approximated by Bayesian adaptation using the sparse minority class data. Random synthetic data is generated from the adapted GMM and an additional assignment rule assigns this data to either the minority class or else discards it. The obtained synthetic data is employed in combination with the available original data to train a support vector machine classifier. The examined application in this paper is optical on-line process monitoring of laser brazing with only rare sporadic occurring defects. Experiments with different amounts of minority class data samples and comparisons to other methods show that this approach performs very well for highly imbalanced datasets.

[1]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[2]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[3]  Michael Brady,et al.  Novelty detection for the identification of masses in mammograms , 1995 .

[4]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[5]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[6]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[7]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[8]  David M. J. Tax,et al.  One-class classification , 2001 .

[9]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[10]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[11]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[12]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[13]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[14]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[15]  Gary M. Weiss,et al.  Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs? , 2007, DMIN.

[16]  Joydeep Ghosh,et al.  Generative Oversampling for Mining Imbalanced Datasets , 2007, DMIN.

[17]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[18]  Fritz Klocke,et al.  On-line quality control system for laser brazing , 2009 .

[19]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[20]  Michael Schmidt,et al.  Possibilities for online process monitoring at laser brazing based on two dimensional detector systems , 2009 .

[21]  Volker Märgner,et al.  In-situ quality monitoring during laser brazing , 2010 .

[22]  Tze-Yun Leong,et al.  A Model Driven Approach to Imbalanced Data Sampling in Medical Decision Making , 2010, MedInfo.

[23]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[24]  Volker Märgner,et al.  On-line Detection of Imperfections in Laser-brazed Joints , 2011, MVA.

[25]  Sheng Chen,et al.  Probability density function estimation based over-sampling for imbalanced two-class problems , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).