AdaOUBoost: adaptive over-sampling and under-sampling to boost the concept learning in large scale imbalanced data sets

Automatic concept learning from large scale imbalanced data sets is a key issue in video semantic analysis and retrieval, which means the number of negative examples is far more than that of positive examples for each concept in the training data. The existing methods adopt generally under-sampling for the majority negative examples or over-sampling for the minority positive examples to balance the class distribution on training data. The main drawbacks of these methods are: (1) As a key factor that affects greatly the performance, in most existing methods, the degree of re-sampling needs to be pre-fixed, which is not generally the optimal choice; (2) Many useful negative samples may be discarded in under-sampling. In addition, some works only focus on the improvement of the computational speed, rather than the accuracy. To address the above issues, we propose a new approach and algorithm named AdaOUBoost (Adaptive Over-sampling and Under-sampling Boost). The novelty of AdaOUBoost mainly lies in: adaptively over-sample the minority positive examples and under-sample the majority negative examples to form different sub-classifiers. And combine these sub-classifiers according to their accuracy to create a strong classifier, which aims to use fully the whole training data and improve the performance of the class-imbalance learning classifier. In AdaOUBoost, first, our clustering-based under-sampling method is employed to divide the majority negative examples into some disjoint subsets. Then, for each subset of negative examples, we utilize the borderline-SMOTE (synthetic minority over-sampling technique) algorithm to over-sample the positive examples with different size, train each sub-classifier using each of them, and get the classifier by fusing these sub-classifiers with different weights. Finally, we combine these classifiers in each subset of negative examples to create a strong classifier. We compare the performance between AdaOUBoost and the state-of-the-art methods on TRECVID 2008 benchmark with all 20 concepts, and the results show the AdaOUBoost can achieve the superior performance in large scale imbalanced data sets.

[1]  C. Lee Giles,et al.  Active learning for class imbalance problem , 2007, SIGIR.

[2]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[3]  N. Japkowicz Learning from Imbalanced Data Sets: A Comparison of Various Strategies * , 2000 .

[4]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[5]  Lei Cao,et al.  Peking University at TRECVID 2008: High Level Feature Extraction , 2008, TRECVID.

[6]  Bo Zhang,et al.  Learning concepts from large scale imbalanced data sets using support cluster machines , 2006, MM '06.

[7]  Yi Lin,et al.  Support Vector Machines for Classification in Nonstandard Situations , 2002, Machine Learning.

[8]  Rainer Stiefelhagen,et al.  Universit¨ at Karlsruhe (TH) at TRECVID 2008 , 2007 .

[9]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[10]  Chong-Wah Ngo,et al.  Columbia University/VIREO-CityU/IRIT TRECVID2008 High-Level Feature Extraction and Interactive Video Search , 2008, TRECVID.

[11]  Dennis Koelma,et al.  The MediaMill TRECVID 2008 Semantic Video Search Engine , 2008, TRECVID.

[12]  Roland Mörzinger,et al.  TRECVID 2007 High Level Feature Extraction experiments at JOANNEUM RESEARCH , 2007, TRECVID.

[13]  Edward Y. Chang,et al.  KBA: kernel boundary alignment considering imbalanced data distribution , 2005, IEEE Transactions on Knowledge and Data Engineering.

[14]  Markus Koch,et al.  Learning TRECVID'08 High-Level Features from YouTube , 2008, TRECVID.

[15]  Duy-Dinh Le,et al.  National Institute of Informatics, Japan at TRECVID 2008 , 2008, TRECVID.

[16]  Duy-Dinh Le,et al.  National institute of informatics, japan at TRECVID 2007: BBC rushes summarization , 2007, TVS '07.

[17]  Emine Yilmaz,et al.  Estimating average precision with incomplete and imperfect judgments , 2006, CIKM '06.

[18]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[19]  Hung-Khoon Tan,et al.  Beyond Semantic Search: What You Observe May Not Be What You Think , 2008, TRECVID.

[20]  Matti Pietikäinen,et al.  Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[21]  John R. Smith,et al.  IBM Research TRECVID-2009 Video Retrieval System , 2009, TRECVID.

[22]  Lei Cen,et al.  Fudan University at TRECVID 2008 , 2008, TRECVID.

[23]  Edward Y. Chang,et al.  Adaptive Feature-Space Conformal Transformation for Imbalanced-Data Learning , 2003, ICML.

[24]  C. Lee Giles,et al.  Learning on the border: active learning in imbalanced data classification , 2007, CIKM '07.

[25]  Sheng Chen,et al.  A Kernel-Based Two-Class Classifier for Imbalanced Data Sets , 2007, IEEE Transactions on Neural Networks.

[26]  Nello Cristianini,et al.  Controlling the Sensitivity of Support Vector Machines , 1999 .

[27]  Paul Over,et al.  High-level feature detection from video in TRECVid: a 5-year retrospective of achievements , 2009 .

[28]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.