An improved adaboost algorithm for imbalanced data based on weighted KNN

Imbalanced data become an obstacle in data mining nowadays, minority class sometimes are more important than majority class, just like in medical diagnosis, credit card fraud and etc. This paper focuses on the imbalanced data problem that adaboost algorithm cannot get a proper accuracy rate for minority class, and propose an improved adaboost algorithm for imbalanced data based on weighted KNN(K-Adaboost). K-Adaboost uses KNN algorithm to cut down majority class weights which is near to minority class, so that the classify can pay more attention to minority class. Besides, the paper uses a new error function and sets a threshold during classifying process in order to avoid weight distortion.

[1]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[2]  Vipin Kumar,et al.  Evaluating boosting algorithms to classify rare classes: comparison and improvements , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[3]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[4]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[5]  Paul A. Viola,et al.  Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade , 2001, NIPS.

[6]  Bianca Zadrozny,et al.  Learning and making decisions when costs and probabilities are both unknown , 2001, KDD '01.

[7]  Xiang Enning Dynamic Weights and Pre-partitioning Real-Adaboost Face Detection Algorithm , 2007 .

[8]  Dimitris Kanellopoulos,et al.  Handling imbalanced datasets: A review , 2006 .

[9]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[10]  Ramakant Nevatia,et al.  Improved Rooftop Detection in Aerial Images with Machine Learning , 2003, Machine Learning.

[11]  John Shawe-Taylor,et al.  Optimizing Classifers for Imbalanced Training Sets , 1998, NIPS.

[12]  David J. Hand,et al.  Measuring classifier performance: a coherent alternative to the area under the ROC curve , 2009, Machine Learning.

[13]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[14]  Ding Xiaoqing,et al.  AdaBoost algorithm using multi-step correction , 2008 .

[15]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[16]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.