A New K-Nearest Neighbors Classifier for Big Data Based on Efficient Data Pruning

The K-nearest neighbors (KNN) machine learning algorithm is a well-known non-parametric classification method. However, like other traditional data mining methods, applying it on big data comes with computational challenges. Indeed, KNN determines the class of a new sample based on the class of its nearest neighbors; however, identifying the neighbors in a large amount of data imposes a large computational cost so that it is no longer applicable by a single computing machine. One of the proposed techniques to make classification methods applicable on large datasets is pruning. LC-KNN is an improved KNN method which first clusters the data into some smaller partitions using the K-means clustering method; and then applies the KNN for each new sample on the partition which its center is the nearest one. However, because the clusters have different shapes and densities, selection of the appropriate cluster is a challenge. In this paper, an approach has been proposed to improve the pruning phase of the LC-KNN method by taking into account these factors. The proposed approach helps to choose a more appropriate cluster of data for looking for the neighbors, thus, increasing the classification accuracy. The performance of the proposed approach is evaluated on different real datasets. The experimental results show the effectiveness of the proposed approach and its higher classification accuracy and lower time cost in comparison to other recent relevant methods.

[1]  Ho-Hyun Park,et al.  Tagging and classifying facial images in cloud environments based on KNN using MapReduce , 2015 .

[2]  Xuelong Li,et al.  Visual-Textual Joint Relevance Learning for Tag-Based Social Image Search , 2013, IEEE Transactions on Image Processing.

[3]  Lydia E. Kavraki,et al.  Distributed computation of the knn graph for large high-dimensional point sets , 2007, J. Parallel Distributed Comput..

[4]  Spiros Skiadopoulos,et al.  FML-kNN: scalable machine learning on Big Data using k-nearest neighbor joins , 2018, Journal of Big Data.

[5]  Yue Gao,et al.  3-D Object Retrieval and Recognition With Hypergraph Analysis , 2012, IEEE Transactions on Image Processing.

[6]  Juan Ramón Rico-Juan,et al.  On the suitability of Prototype Selection methods for kNN classification with distributed data , 2016, Neurocomputing.

[7]  Piyush Kumar,et al.  Fast construction of k-nearest neighbor graphs for point clouds , 2010, IEEE Transactions on Visualization and Computer Graphics.

[8]  Mohamed A. Mahfouz,et al.  RFKNN: ROUGH-FUZZY KNN FOR BIG DATA CLASSIFICATION , 2018 .

[9]  Tibor Kmet,et al.  Modeling Pan Evaporation Using Gaussian Process Regression K-Nearest Neighbors Random Forest and Support Vector Machines; Comparative Analysis , 2020, Atmosphere.

[10]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[11]  Y. Wang,et al.  Large-scale paralleled sparse principal component analysis , 2014, Multimedia Tools and Applications.

[12]  Xindong Wu,et al.  Data mining with big data , 2014, IEEE Transactions on Knowledge and Data Engineering.

[13]  Xin Liu,et al.  Fast density peak clustering for large scale data based on kNN , 2020, Knowl. Based Syst..

[14]  Shichao Zhang,et al.  Efficient kNN classification algorithm for big data , 2016, Neurocomputing.

[15]  Beng Chin Ooi,et al.  Efficient Processing of k Nearest Neighbor Joins using MapReduce , 2012, Proc. VLDB Endow..

[16]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.