Automatic filtering algorithm for imbalanced classification

The imbalanced data set has been reported to hinder the classification performance of many machine learning algorithms on both accuracy and speed. But extremely imbalanced data sets (3∼5% positive samples) are common for many applications, such as multimedia semantic classification. In this paper, we propose a novel algorithm to automatically remove samples that have no or negative effects on classifier training for imbalanced training data sets. By using our algorithm, most easy-to-classify dominant-class samples in imbalanced training set will be eliminated automatically. As a result, the ratio of minority class samples is increased significantly, making it more suitable for classification algorithms. Experiments show that our algorithm can keep the classification accuracy of SVM, and decrease the training time dramatically.

[1]  Jianping Fan,et al.  Extracting informative images from web news pages via imbalanced classification , 2009, MM '09.

[2]  Anil K. Jain,et al.  Displacement Measurement and Its Application in Interframe Image Coding , 1981, IEEE Trans. Commun..

[3]  Jianping Fan,et al.  Incorporating feature hierarchy and boosting to achieve more effective classifier training and concept-oriented video summarization and skimming , 2008, TOMCCAP.

[4]  Peter J. Haug,et al.  Classifying free-text triage chief complaints into syndromic categories with natural language processing , 2005, Artif. Intell. Medicine.

[5]  Ester Bernadó-Mansilla,et al.  The class imbalance problem in learning classifier systems: a preliminary study , 2005, GECCO '05.

[6]  Taghi M. Khoshgoftaar,et al.  Experimental perspectives on learning from imbalanced data , 2007, ICML '07.

[7]  M. Harries Detecting Concept Drift in Financial Time Series Prediction using Symbolic Machine Learning , 1995 .

[8]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[9]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[10]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[11]  Igor Kononenko,et al.  Machine learning for medical diagnosis: history, state of the art and perspective , 2001, Artif. Intell. Medicine.

[12]  C. Lee Giles,et al.  Active learning for class imbalance problem , 2007, SIGIR.