Parallel selective sampling method for imbalanced and large data classification

We proposed a new algorithm to preprocess huge and imbalanced data.This algorithm, based on distance calculations, reduce both size and imbalance.The selective sampling method was conceived for parallel and distributed computing.It was combined with SVM obtaining optimized classification performances.Synthetic and real data sets were used to evaluate the classifiers performances. Several applications aim to identify rare events from very large data sets. Classification algorithms may present great limitations on large data sets and show a performance degradation due to class imbalance. Many solutions have been presented in literature to deal with the problem of huge amount of data or imbalancing separately. In this paper we assessed the performances of a novel method, Parallel Selective Sampling (PSS), able to select data from the majority class to reduce imbalance in large data sets. PSS was combined with the Support Vector Machine (SVM) classification. PSS-SVM showed excellent performances on synthetic data sets, much better than SVM. Moreover, we showed that on real data sets PSS-SVM classifiers had performances slightly better than those of SVM and RUSBoost classifiers with reduced processing times. In fact, the proposed strategy was conceived and designed for parallel and distributed computing. In conclusion, PSS-SVM is a valuable alternative to SVM and RUSBoost for the problem of classification by huge and imbalanced data, due to its accurate statistical predictions and low computational complexity.

[1]  Martin Styner,et al.  A comparison of automated segmentation and manual tracing for quantifying hippocampal and amygdala volumes , 2009, NeuroImage.

[2]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[3]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[4]  Alberto Refice,et al.  SAR and InSAR for Flood Monitoring: Examples With COSMO-SkyMed Data , 2014, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[5]  Massimiliano Pontil,et al.  Support Vector Machines with Clustering for Training with Very Large Datasets , 2002, SETN.

[6]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[7]  Han Tong Loh,et al.  Imbalanced text classification: A term weighting approach , 2009, Expert Syst. Appl..

[8]  Palma Blonda,et al.  Neural network ensemble and support vector machine classifiers for the analysis of remotely sensed data: a comparison , 2002, IEEE International Geoscience and Remote Sensing Symposium.

[9]  Erik Hjelmås,et al.  Face Detection: A Survey , 2001, Comput. Vis. Image Underst..

[10]  Xiaoqian Jiang,et al.  Improving predictions in imbalanced data using Pairwise Expanded Logistic Regression. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[11]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[12]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[13]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[14]  Thomas Oommen,et al.  Sampling Bias and Class Imbalance in Maximum-likelihood Logistic Regression , 2011 .

[15]  Qiangwang A Hybrid Sampling SVM Approach to Imbalanced Data Classification , 2014 .

[16]  Rong Yan,et al.  On predicting rare classes with SVM ensembles in scene classification , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[17]  Geoffrey J. McLachlan,et al.  Classification of Imbalanced Marketing Data with Balanced Random Sets , 2009, KDD Cup.

[18]  Liana G. Apostolova,et al.  Comparison of AdaBoost and Support Vector Machines for Detecting Alzheimer's Disease Through Automated Hippocampal Segmentation , 2010, IEEE Transactions on Medical Imaging.

[19]  Massimiliano Pontil,et al.  Face Detection in Still Gray Images , 2000 .

[20]  Yanqing Zhang,et al.  SVMs Modeling for Highly Imbalanced Classification , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[21]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[22]  Sebastiano Stramaglia,et al.  Supervised algorithms for particle classification by a transition radiation detector , 2003 .

[23]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[24]  Lorenzo Bruzzone,et al.  Classification of imbalanced remote-sensing data by neural networks , 1997, Pattern Recognit. Lett..

[25]  Ron Kikinis,et al.  Statistical validation of image segmentation quality based on a spatial overlap index. , 2004, Academic radiology.

[26]  Q. Wang A Hybrid Sampling SVM Approach to Imbalanced Data Classification , 2014 .

[27]  Mingzhu Tang,et al.  Cost-Sensitive Support Vector Machine Using Randomized Dual Coordinate Descent Method for Big Class-Imbalanced Data Classification , 2014 .

[28]  Vladimir Vapnik,et al.  An overview of statistical learning theory , 1999, IEEE Trans. Neural Networks.

[29]  Stan Matwin,et al.  Machine Learning for the Detection of Oil Spills in Satellite Radar Images , 1998, Machine Learning.

[30]  Foster J. Provost,et al.  A Survey of Methods for Scaling Up Inductive Algorithms , 1999, Data Mining and Knowledge Discovery.

[31]  Abdul Ghaaliq Lalkhen,et al.  Clinical tests: sensitivity and specificity , 2008 .

[32]  Nicola Ancona,et al.  Data representations and generalization error in kernel based learning machines , 2006, Pattern Recognit..

[33]  Sungzoon Cho,et al.  EUS SVMs: Ensemble of Under-Sampled SVMs for Data Imbalance Problems , 2006, ICONIP.

[34]  Xiaoou Li,et al.  Support vector machine classification for large data sets via minimum enclosing ball clustering , 2008, Neurocomputing.

[35]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[36]  Ivor W. Tsang,et al.  Core Vector Machines: Fast SVM Training on Very Large Data Sets , 2005, J. Mach. Learn. Res..