Instance Selection for Imbalanced Data

Imbalanced data exhibit an unequal distribution with respect to the class labels. In a two-class imbalanced problem, elements of the majority class can vastly outnumber those belonging to the minority class. Several standard learning methods are hindered by such skewness present in the training set and fail to recognize minority instances in a posterior classification process. In real-world applications, e.g. in the medical domain or in the context of fraud detection, the minority class will usually be the class of interest. This motivates the development of techniques overcoming the challenges posed by data imbalance and ensuring an improvement of the classification performance. A considerable body of research (He & Garcia, 2009) has recently been done in this area. One prominent family of solutions are the resampling methods, which balance the dataset by introducing additional minority elements (oversampling), removing certain majority elements (undersampling) or a combination of both (hybrid methods).

[1]  E. B. Wilson Probable Inference, the Law of Succession, and Statistical Inference , 1927 .

[2]  Robert E. Schapire,et al.  The strength of weak learnability , 1990, Mach. Learn..

[3]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[4]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[5]  Jesus A. Gonzalez,et al.  Machine Learning for Imbalanced Datasets: Application in Medical Diagnostic , 2006, FLAIRS.

[6]  Tony R. Martinez,et al.  Reduction Techniques for Instance-Based Learning Algorithms , 2000, Machine Learning.

[7]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[8]  Francisco Herrera,et al.  OWA-FRPS: A Prototype Selection Method Based on Ordered Weighted Average Fuzzy Rough Set Theory , 2013, RSFDGrC.

[9]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[10]  Richard Nock,et al.  Instance Pruning as an Information Preserving Problem , 2000, ICML.

[11]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[12]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[13]  Shi Bing,et al.  Inductive learning algorithms and representations for text categorization , 2006 .

[14]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[15]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[16]  Fabrizio Angiulli,et al.  Fast Nearest Neighbor Condensation for Large Data Sets Classification , 2007, IEEE Transactions on Knowledge and Data Engineering.

[17]  Francisco Herrera,et al.  A memetic algorithm for evolutionary prototype selection: A scaling up approach , 2008, Pattern Recognit..

[18]  Chia-Cheng Liu,et al.  Design of an optimal nearest neighbor classifier using an intelligent genetic algorithm , 2002, Proceedings of the 2002 Congress on Evolutionary Computation. CEC'02 (Cat. No.02TH8600).

[19]  Nathalie Japkowicz,et al.  The Class Imbalance Problem: Significance and Strategies , 2000 .

[20]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[21]  José Salvador Sánchez,et al.  Decision boundary preserving prototype selection for nearest neighbor classification , 2005, Int. J. Pattern Recognit. Artif. Intell..

[22]  Filiberto Pla,et al.  Prototype selection for the nearest neighbour rule through proximity graphs , 1997, Pattern Recognit. Lett..

[23]  Francisco Herrera,et al.  Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study , 2003, IEEE Trans. Evol. Comput..

[24]  Mahendra Sahare,et al.  A Review of Multi-Class Classification for Imbalanced Data , 2012 .

[25]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[26]  Elena Marchiori,et al.  Hit Miss Networks with Applications to Instance Selection , 2008, J. Mach. Learn. Res..

[27]  Thomas M. Cover,et al.  Estimation by the nearest neighbor rule , 1968, IEEE Trans. Inf. Theory.

[28]  Lih-Yuan Deng,et al.  Orthogonal Arrays: Theory and Applications , 1999, Technometrics.

[29]  Kazuo Hattori,et al.  A new edited k-nearest neighbor rule in the pattern classification problem , 2000, Pattern Recognit..

[30]  Francisco Herrera,et al.  EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling , 2013, Pattern Recognit..

[31]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[32]  Yoav Freund,et al.  Boosting a weak learning algorithm by majority , 1990, COLT '90.

[33]  Mikel Galar,et al.  Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches , 2013, Knowl. Based Syst..

[34]  Chi-Jen Lu,et al.  Adaptive Prototype Learning Algorithms: Theoretical and Experimental Studies , 2006, J. Mach. Learn. Res..

[35]  Yue-Shi Lee,et al.  Under-Sampling Approaches for Improving Prediction of the Minority Class in an Imbalanced Dataset , 2006 .

[36]  Francisco Herrera,et al.  A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms , 2011, Swarm Evol. Comput..

[37]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[38]  Gary M. Weiss The Impact of Small Disjuncts on Classifier Learning , 2010, Data Mining.

[39]  Szymon Wilk,et al.  Learning from Imbalanced Data in Presence of Noisy and Borderline Examples , 2010, RSCTC.

[40]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[41]  D. Bamber The area above the ordinal dominance graph and the area below the receiver operating characteristic graph , 1975 .

[42]  David W. Aha,et al.  Simplifying decision trees: A survey , 1997, The Knowledge Engineering Review.

[43]  Jerzy W. Grzymala-Busse,et al.  An Approach to Imbalanced Data Sets Based on Changing Rule Strength , 2004, Rough-Neural Computing: Techniques for Computing with Words.

[44]  M. Friedman The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .

[45]  Szymon Wilk,et al.  Selective Pre-processing of Imbalanced Data for Improving Classification Performance , 2008, DaWaK.

[46]  Gustavo E. A. P. A. Batista,et al.  Class Imbalances versus Class Overlapping: An Analysis of a Learning System Behavior , 2004, MICAI.

[47]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[48]  José Salvador Sánchez,et al.  On the k-NN performance in a challenging scenario of imbalance and overlapping , 2008, Pattern Analysis and Applications.

[49]  Andrew K. C. Wong,et al.  Classification of Imbalanced Data: a Review , 2009, Int. J. Pattern Recognit. Artif. Intell..

[50]  Misha Denil,et al.  Overlap versus Imbalance , 2010, Canadian Conference on AI.

[51]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[52]  G. Gates,et al.  The reduced nearest neighbor rule (Corresp.) , 1972, IEEE Trans. Inf. Theory.

[53]  Peter A. Flach,et al.  Machine Learning - The Art and Science of Algorithms that Make Sense of Data , 2012 .

[54]  Johannes Fürnkranz,et al.  Pruning Algorithms for Rule Learning , 1997, Machine Learning.

[55]  Andreas Holzinger,et al.  Data Mining with Decision Trees: Theory and Applications , 2015, Online Inf. Rev..

[56]  Miguel Toro,et al.  Finding representative patterns with ordered projections , 2003, Pattern Recognit..

[57]  David J. Hand,et al.  Choosing k for two-class nearest neighbour classifiers with unbalanced classes , 2003, Pattern Recognit. Lett..

[58]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[59]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[60]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[61]  José Hernández-Orallo,et al.  Volume under the ROC Surface for Multi-class Problems , 2003, ECML.

[62]  Francisco Herrera,et al.  SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory , 2012, Knowledge and Information Systems.

[63]  Gary M. Weiss Mining with Rare Cases , 2010, Data Mining and Knowledge Discovery Handbook.

[64]  Yue-Shi Lee,et al.  Cluster-based under-sampling approaches for imbalanced data distributions , 2009, Expert Syst. Appl..

[65]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[66]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[67]  Francisco Herrera,et al.  Prototype Selection for Nearest Neighbor Classification: Taxonomy and Empirical Study , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[68]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[69]  Nello Cristianini,et al.  Controlling the Sensitivity of Support Vector Machines , 1999 .

[70]  Carla E. Brodley,et al.  Addressing the Selective Superiority Problem: Automatic Algorithm/Model Class Selection , 1993 .

[71]  D. Dubois,et al.  ROUGH FUZZY SETS AND FUZZY ROUGH SETS , 1990 .

[72]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[73]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[74]  Albert Y. Zomaya,et al.  A particle swarm based hybrid system for imbalanced medical data sampling , 2009, BMC Genomics.

[75]  Salvatore J. Stolfo,et al.  Toward Scalable Learning with Non-Uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection , 1998, KDD.

[76]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[77]  Kihoon Yoon,et al.  An unsupervised learning approach to resolving the data imbalanced issue in supervised learning problems in functional genomics , 2005, Fifth International Conference on Hybrid Intelligent Systems (HIS'05).

[78]  Dragos D. Margineantu,et al.  Class Probability Estimation and Cost-Sensitive Classification Decisions , 2002, ECML.

[79]  J. Platt Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .

[80]  Nitesh V. Chawla,et al.  Classification and knowledge discovery in protein databases , 2004, J. Biomed. Informatics.

[81]  M. Narasimha Murty,et al.  An incremental prototype set building technique , 2002, Pattern Recognit..

[82]  Francisco Herrera,et al.  Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics , 2012, Expert Syst. Appl..

[83]  Roberto Alejo,et al.  Analysis of new techniques to obtain quality training sets , 2003, Pattern Recognit. Lett..

[84]  Chumphol Bunkhumpornpat,et al.  Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem , 2009, PAKDD.

[85]  David J. Hand,et al.  A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems , 2001, Machine Learning.

[86]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[87]  Lotfi A. Zadeh,et al.  Fuzzy Sets , 1996, Inf. Control..

[88]  Jing Zhao,et al.  ACOSampling: An ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data , 2013, Neurocomputing.

[89]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[90]  Edward Y. Chang,et al.  Class-Boundary Alignment for Imbalanced Dataset Learning , 2003 .

[91]  Larry Bull,et al.  Mining breast cancer data with XCS , 2007, GECCO '07.

[92]  L B Lusted,et al.  Radiographic applications of receiver operating characteristic (ROC) curves. , 1974, Radiology.

[93]  José Francisco Martínez Trinidad,et al.  A new fast prototype selection method based on clustering , 2010, Pattern Analysis and Applications.

[94]  Yu-Lin He,et al.  NRMCS : Noise removing based on the MCS , 2008, 2008 International Conference on Machine Learning and Cybernetics.

[95]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[96]  Shuigeng Zhou,et al.  C-pruner: an improved instance pruning algorithm , 2003, Proceedings of the 2003 International Conference on Machine Learning and Cybernetics (IEEE Cat. No.03EX693).

[97]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[98]  G. Yule On the Association of Attributes in Statistics: With Illustrations from the Material of the Childhood Society, &c , 1900 .

[99]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[100]  Kai Ming Ting,et al.  An Instance-weighting Method to Induce Cost-sensitive Trees , 2001 .

[101]  Paul Jen-Hwa Hu,et al.  A preclustering-based ensemble learning technique for acute appendicitis diagnoses , 2013, Artif. Intell. Medicine.

[102]  Jorma Laurikkala,et al.  Improving Identification of Difficult Small Classes by Balancing Class Distribution , 2001, AIME.

[103]  David B. Skalak,et al.  Prototype and Feature Selection by Sampling and Random Mutation Hill Climbing Algorithms , 1994, ICML.

[104]  Chien-Hsing Chou,et al.  The Generalized Condensed Nearest Neighbor Rule as A Data Reduction Method , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[105]  Ekrem Duman,et al.  Comparing alternative classifiers for database marketing: The case of imbalanced datasets , 2012, Expert Syst. Appl..

[106]  Marek Grochowski,et al.  Comparison of Instances Seletion Algorithms I. Algorithms Survey , 2004, ICAISC.

[107]  Ronald R. Yager,et al.  On ordered weighted averaging aggregation operators in multicriteria decisionmaking , 1988, IEEE Trans. Syst. Man Cybern..

[108]  N. Graham,et al.  Areas beneath the relative operating characteristics (ROC) and relative operating levels (ROL) curves: Statistical significance and interpretation , 2002 .

[109]  Xin Yao,et al.  MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning , 2014 .

[110]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[111]  Nittaya Kerdprasop,et al.  On the Generation of Accurate Predictive Model from Highly Imbalanced Data with Heuristics and Replication Techniques , 2012 .

[112]  Chris Mellish,et al.  Advances in Instance Selection for Instance-Based Learning Algorithms , 2002, Data Mining and Knowledge Discovery.

[113]  Filiberto Pla,et al.  Using the Geometrical Distribution of Prototypes for Training Set Condensing , 2003, CAEPIA.

[114]  Tan Yee Fan,et al.  A Tutorial on Support Vector Machine , 2009 .

[115]  Francisco Herrera,et al.  Preprocessing noisy imbalanced datasets using SMOTE enhanced with fuzzy rough prototype selection , 2014, Appl. Soft Comput..

[116]  Francisco Herrera,et al.  FRPS: A Fuzzy Rough Prototype Selection method , 2013, Pattern Recognit..

[117]  Xin Yao,et al.  Diversity analysis on imbalanced data sets by using ensemble models , 2009, 2009 IEEE Symposium on Computational Intelligence and Data Mining.

[118]  Nicolás García-Pedrajas,et al.  A cooperative coevolutionary algorithm for instance selection for instance-based learning , 2010, Machine Learning.

[119]  Filiberto Pla,et al.  A Stochastic Approach to Wilson's Editing Algorithm , 2005, IbPRIA.

[120]  Francisco Herrera,et al.  Improving SMOTE with Fuzzy Rough Prototype Selection to Detect Noise in Imbalanced Classification Data , 2012, IBERAMIA.

[121]  Robert Sabourin,et al.  Iterative Boolean combination of classifiers in the ROC space: An application to anomaly detection with HMMs , 2010, Pattern Recognit..

[122]  øöö Blockinøø Well-Trained PETs : Improving Probability Estimation , 2000 .

[123]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[124]  Jerzy W. Grzymala-Busse,et al.  Rough Sets , 1995, Commun. ACM.

[125]  Rm Cameron-Jones,et al.  Instance Selection by Encoding Length Heuristic with Random Mutation Hill Climbing , 1995 .

[126]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .