FIM-Based Pairwise Selection for Active Learning on Imbalanced Datasets

Active learning has exhibited strong ability in improving efficiency of practical classification tasks. However, when applied to imbalanced datasets, traditional selecting strategies for active learning can be severely disturbed by redundant majority instances thus fail to offer inherent capability. In this paper, Fisher Information Matrix based pair wise selection (FIMPS) is proposed to solve the malfunction of active learning on imbalanced datasets. The FIMPS is primarily derived from FIM-based selection by introducing pair wise selecting strategy to adapt to the features of imbalanced data distribution. During active learning, the FIMPS is performed on an initial classifier generated by SVM variant for imbalanced problem to extract a balanced dataset from unlabeled pool for training set update. Thus, the retraining process can be freed from the interference of imbalanced problem. FIMPS is compared with various selecting approaches on a variety of real-world datasets. The result shows that the proposed method outperforms other strategies in enhancing classification performance of active learning.

[1]  Rong Jin,et al.  Batch mode active learning and its application to medical image classification , 2006, ICML.

[2]  Francisco Herrera,et al.  Evolutionary-based selection of generalized instances for imbalanced classification , 2012, Knowl. Based Syst..

[3]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[4]  Shasha Wang,et al.  Cost-sensitive Bayesian network classifiers , 2014, Pattern Recognit. Lett..

[5]  Javier Pérez-Rodríguez,et al.  OligoIS: Scalable Instance Selection for Class-Imbalanced Data Sets , 2013, IEEE Transactions on Cybernetics.

[6]  Francisco Herrera,et al.  Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling , 2011, Soft Comput..

[7]  Rayid Ghani,et al.  Online Active Learning with Imbalanced Classes , 2013, 2013 IEEE 13th International Conference on Data Mining.

[8]  Mark Craven,et al.  Curious machines: active learning with structured instances , 2008 .

[9]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[10]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[11]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[12]  JuiHsi Fu,et al.  Certainty-based active learning for sampling imbalanced datasets , 2013, Neurocomputing.

[13]  Francisco Herrera,et al.  Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics , 2012, Expert Syst. Appl..

[14]  U. Hahn,et al.  Reducing class imbalance during active learning for named entity annotation , 2009, K-CAP '09.

[15]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[16]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[17]  Hsuan-Tien Lin,et al.  A note on Platt’s probabilistic outputs for support vector machines , 2007, Machine Learning.

[18]  C. Lee Giles,et al.  Learning on the border: active learning in imbalanced data classification , 2007, CIKM '07.

[19]  Mark Craven,et al.  An Analysis of Active Learning Strategies for Sequence Labeling Tasks , 2008, EMNLP.

[20]  Yanqing Zhang,et al.  SVMs Modeling for Highly Imbalanced Classification , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[21]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[22]  Jui Hsi Fu,et al.  Certainty-Enhanced Active Learning for Improving Imbalanced Data Classification , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[23]  Marco Vannucci,et al.  A method for resampling imbalanced datasets in binary classification tasks for real-world problems , 2014, Neurocomputing.

[24]  Rong Jin,et al.  Batch Mode Active Learning with Applications to Text Categorization and Image Retrieval , 2009, IEEE Transactions on Knowledge and Data Engineering.