A Proximity Weighted Evidential k Nearest Neighbor Classifier for Imbalanced Data

In k Nearest Neighbor (kNN) classifier, a query instance is classified based on the most frequent class of its nearest neighbors among the training instances. In imbalanced datasets, kNN becomes biased towards the majority instances of the training space. To solve this problem, we propose a method called Proximity weighted Evidential kNN classifier. In this method, each neighbor of a query instance is considered as a piece of evidence from which we calculate the probability of class label given feature values to provide more preference to the minority instances. This is then discounted by the proximity of the neighbor to prioritize the closer instances in the local neighborhood. These evidences are then combined using Dempster-Shafer theory of evidence. A rigorous experiment over 30 benchmark imbalanced datasets shows that our method performs better compared to 12 popular methods. In pairwise comparison of these 12 methods with our method, in the best case, our method wins in 29 datasets, and in the worst case it wins in least 19 datasets. More importantly, according to Friedman test the proposed method ranks higher than all other methods in terms of AUC at 5% level of significance.

[1]  Jesús Alcalá-Fdez,et al.  KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework , 2011, J. Multiple Valued Log. Soft Comput..

[2]  Jianping Gou,et al.  A generalized mean distance-based k-nearest neighbor classifier , 2019, Expert Syst. Appl..

[3]  Sahibsingh A. Dudani The Distance-Weighted k-Nearest-Neighbor Rule , 1976, IEEE Transactions on Systems, Man, and Cybernetics.

[4]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[5]  Xindong Wu,et al.  10 Challenging Problems in Data Mining Research , 2006, Int. J. Inf. Technol. Decis. Mak..

[6]  Xiuzhen Zhang,et al.  Improving k Nearest Neighbor with Exemplar Generalization for Imbalanced Classification , 2011, PAKDD.

[7]  Wei Liu,et al.  Class Confidence Weighted kNN Algorithms for Imbalanced Data Sets , 2011, PAKDD.

[8]  Thierry Denoeux,et al.  A k-nearest neighbor classification rule based on Dempster-Shafer theory , 1995, IEEE Trans. Syst. Man Cybern..

[9]  J. L. Hodges,et al.  Discriminatory Analysis - Nonparametric Discrimination: Consistency Properties , 1989 .

[10]  Theodore B. Trafalis,et al.  Support vector regression with noisy data: a second order cone programming approach , 2007, Int. J. Gen. Syst..

[11]  J A Swets,et al.  Measuring the accuracy of diagnostic systems. , 1988, Science.

[12]  Glenn Shafer,et al.  A Mathematical Theory of Evidence , 2020, A Mathematical Theory of Evidence.

[13]  Amin Ahsan Ali,et al.  Inter-node Hellinger Distance based Decision Tree , 2019, IJCAI.

[14]  David A. Cieslak,et al.  Learning Decision Trees for Unbalanced Data , 2008, ECML/PKDD.

[15]  Vikram Pudi,et al.  Class Based Weighted K-Nearest Neighbor over Imbalance Dataset , 2013, PAKDD.

[16]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[17]  David A. Cieslak,et al.  A Robust Decision Tree Algorithm for Imbalanced Data Sets , 2010, SDM.

[18]  P. Deb Finite Mixture Models , 2008 .