Absent data generating classifier for imbalanced class sizes

We propose an algorithm for two-class classification problems when the training data are imbalanced. This means the number of training instances in one of the classes is so low that the conventional classification algorithms become ineffective in detecting the minority class. We present a modification of the kernel Fisher discriminant analysis such that the imbalanced nature of the problem is explicitly addressed in the new algorithm formulation. The new algorithm exploits the properties of the existing minority points to learn the effects of other minority data points, had they actually existed. The algorithm proceeds iteratively by employing the learned properties and conditional sampling in such a way that it generates sufficient artificial data points for the minority set, thus enhancing the detection probability of the minority class. Implementing the proposed method on a number of simulated and real data sets, we show that our proposed method performs competitively compared to a set of alternative state-of-the-art imbalanced classification algorithms.

[1]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[2]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[3]  Peter Kulchyski and , 2015 .

[4]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[5]  Klaus Nordhausen,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition by Trevor Hastie, Robert Tibshirani, Jerome Friedman , 2009 .

[6]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[7]  J. J. Chen,et al.  Classification ensembles for unbalanced class sizes in predictive toxicology , 2005, SAR and QSAR in environmental research.

[8]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  M. Inaba Application of weighted Voronoi diagrams and randomization to variance-based k-clustering , 1994, SoCG 1994.

[10]  Ian H. Witten,et al.  One-Class Classification by Combining Density and Class Probability Estimation , 2008, ECML/PKDD.

[11]  Ata Kabán,et al.  Error bounds for Kernel Fisher Linear Discriminant in Gaussian Hilbert space , 2012, AISTATS.

[12]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[13]  B. Scholkopf,et al.  Fisher discriminant analysis with kernels , 1999, Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468).

[14]  M. Maloof Learning When Data Sets are Imbalanced and When Costs are Unequal and Unknown , 2003 .

[15]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[16]  Byron C. Wallace,et al.  Class Probability Estimates are Unreliable for Imbalanced Data (and How to Fix Them) , 2012, 2012 IEEE 12th International Conference on Data Mining.

[17]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[18]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[19]  Alexander Gammerman,et al.  Learning by Transduction , 1998, UAI.

[20]  P. Bickel,et al.  Some theory for Fisher''s linear discriminant function , 2004 .

[21]  Michael Rabadi,et al.  Kernel Methods for Machine Learning , 2015 .

[22]  Mary Inaba,et al.  Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering: (extended abstract) , 1994, SCG '94.

[23]  Ron Kohavi,et al.  The Case against Accuracy Estimation for Comparing Induction Algorithms , 1998, ICML.

[24]  Yu Ding,et al.  A Computable Plug-In Estimator of Minimum Volume Sets for Novelty Detection , 2010, Oper. Res..

[25]  Carla E. Brodley,et al.  Class Imbalance, Redux , 2011, 2011 IEEE 11th International Conference on Data Mining.

[26]  Philipp Birken,et al.  Numerical Linear Algebra , 2011, Encyclopedia of Parallel Computing.

[27]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[28]  J. Franklin,et al.  The elements of statistical learning: data mining, inference and prediction , 2005 .

[29]  Joydeep Ghosh,et al.  Generative Oversampling for Mining Imbalanced Datasets , 2007, DMIN.

[30]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[31]  Yu Ding,et al.  A classification procedure for highly imbalanced class sizes , 2010 .

[32]  Zhi-Hua Zhou,et al.  Exploratory Undersampling for Class-Imbalance Learning , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[33]  Kai Ming Ting,et al.  An Instance-weighting Method to Induce Cost-sensitive Trees , 2001 .

[34]  Henry Wolkowicz,et al.  On Lagrangian Relaxation of Quadratic Matrix Constraints , 2000, SIAM J. Matrix Anal. Appl..

[35]  Nathalie Japkowicz,et al.  The Class Imbalance Problem: Significance and Strategies , 2000 .

[36]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[37]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[38]  Ben J. A. Kröse,et al.  Efficient Greedy Learning of Gaussian Mixture Models , 2003, Neural Computation.