An instance-based learning recommendation algorithm of imbalance handling methods

Abstract Imbalance learning is a typical problem in domain of machine learning and data mining. Aiming to solve this problem, researchers have proposed lots of the state-of-art techniques, such as Over Sampling, Under Sampling, SMOTE, Cost sensitive, and so on. However, the most appropriate methods on different learning problems are diverse. Given an imbalance learning problem, we proposed an Instance-based Learning (IBL) recommendation algorithm to present the most appropriate imbalance handling method for it. First, the meta knowledge database is created by the binary relation 〈data characteristic measures-the rank of all candidate imbalance handling methods〉 of each data set. Afterwards, when a new data set comes, its characteristics will be extracted and compared with the example in the knowledge database, where the instance-based k -nearest neighbors algorithm is applied to identify the rank of all candidate imbalance handling methods for the new dataset. Finally, the most appropriate imbalance handling method will be derived through combining the recommended rank and individual bias. The experimental results on 80 public binary imbalance datasets confirm that the proposed recommendation algorithm can effectively present the most appropriate imbalance handling method for a given imbalance learning problem, with the hit rate of recommendation up to 95%.

[1]  D. Tax,et al.  The characterization of classification problems by classifier disagreements , 2004, ICPR 2004.

[2]  Nitesh V. Chawla,et al.  Data Mining for Imbalanced Datasets: An Overview , 2005, The Data Mining and Knowledge Discovery Handbook.

[3]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, Sixth International Conference on Data Mining (ICDM'06).

[4]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[5]  Andrew W. Moore,et al.  Locally Weighted Learning for Control , 1997, Artificial Intelligence Review.

[6]  Tin Kam Ho,et al.  Complexity Measures of Supervised Classification Problems , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[8]  Vipin Kumar,et al.  Evaluating boosting algorithms to classify rare classes: comparison and improvements , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[9]  Vern Paxson,et al.  Outside the Closed World: On Using Machine Learning for Network Intrusion Detection , 2010, 2010 IEEE Symposium on Security and Privacy.

[10]  Jesús Alcalá-Fdez,et al.  KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework , 2011, J. Multiple Valued Log. Soft Comput..

[11]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[12]  Qinbao Song,et al.  Using Coding-Based Ensemble Learning to Improve Software Defect Prediction , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[13]  Thomas J. Ostrand,et al.  \{PROMISE\} Repository of empirical software engineering data , 2007 .

[14]  Yong Hu,et al.  The application of data mining techniques in financial fraud detection: A classification framework and an academic review of literature , 2011, Decis. Support Syst..

[15]  Gongping Yang,et al.  On the Class Imbalance Problem , 2008, 2008 Fourth International Conference on Natural Computation.

[16]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[17]  J. L. Hodges,et al.  Discriminatory Analysis - Nonparametric Discrimination: Consistency Properties , 1989 .

[18]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[19]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[20]  Panayiotis E. Pintelas,et al.  Mixture of Expert Agents for Handling Imbalanced Data Sets , 2003 .

[21]  Kate Smith-Miles,et al.  On learning algorithm selection for classification , 2006, Appl. Soft Comput..

[22]  João Gama,et al.  Characterizing the Applicability of Classification Algorithms Using Meta-Level Learning , 1994, ECML.

[23]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[24]  Hilan Bensusan,et al.  Discovering Task Neighbourhoods Through Landmark Learning Performances , 2000, PKDD.

[25]  Qinbao Song,et al.  Automatic recommendation of classification algorithms based on data set characteristics , 2012, Pattern Recognit..

[26]  Stan Matwin,et al.  Machine Learning for the Detection of Oil Spills in Satellite Radar Images , 1998, Machine Learning.

[27]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[28]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[29]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[30]  Carlos Soares,et al.  A Comparison of Ranking Methods for Classification Algorithm Selection , 2000, ECML.

[31]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[32]  N. Japkowicz Learning from Imbalanced Data Sets: A Comparison of Various Strategies * , 2000 .

[33]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[34]  Abbas Akkasi,et al.  Balanced undersampling: a novel sentence-based undersampling method to improve recognition of named entities in chemical and biomedical text , 2017, Applied Intelligence.

[35]  Carlos Soares,et al.  Ranking Learning Algorithms: Using IBL and Meta-Learning on Accuracy and Time Results , 2003, Machine Learning.

[36]  Rosa Maria Valdovinos,et al.  New Applications of Ensembles of Classifiers , 2003, Pattern Analysis & Applications.

[37]  Taghi M. Khoshgoftaar,et al.  Attribute Selection and Imbalanced Data: Problems in Software Defect Prediction , 2010, 2010 22nd IEEE International Conference on Tools with Artificial Intelligence.