Fast and Scalable Approaches to Accelerate the Fuzzy k-Nearest Neighbors Classifier for Big Data

One of the best-known and most effective methods in supervised classification is the k-nearest neighbors algorithm (kNN). Several approaches have been proposed to improve its accuracy, where fuzzy approaches prove to be among the most successful, highlighting the classical fuzzy k-nearest neighbors (FkNN). However, these traditional algorithms fail to tackle the large amounts of data that are available today. There are multiple alternatives to enable kNN classification in big datasets, spotlighting the approximate version of kNN called hybrid spill tree. Nevertheless, the existing proposals of FkNN for big data problems are not fully scalable, because a high computational load is required to obtain the same behavior as the original FkNN algorithm. This article proposes global approximate hybrid spill tree FkNN and local hybrid spill tree FkNN, two approximate approaches that speed up runtime without losing quality in the classification process. The experimentation compares various FkNN approaches for big data with datasets of up to 11 million instances. The results show an improvement in runtime and accuracy over literature algorithms.

[1]  Jun Li,et al.  An Intelligent Parkinson's Disease Diagnostic System Based on a Chaotic Bacterial Foraging Optimization Enhanced Fuzzy KNN Approach , 2018, Comput. Math. Methods Medicine.

[2]  Francisco Herrera,et al.  Exact fuzzy k-nearest neighbor classification for big datasets , 2017, 2017 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE).

[3]  Andrew W. Moore,et al.  An Investigation of Practical Approximate Nearest Neighbor Algorithms , 2004, NIPS.

[4]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[5]  Chris Cornelis,et al.  A Scalable Approach to Fuzzy Rough Nearest Neighbour Classification with Ordered Weighted Averaging Operators , 2019, IJCRS.

[6]  Alicia Troncoso Lora,et al.  Big data time series forecasting based on nearest neighbours distributed computing with Spark , 2018, Knowl. Based Syst..

[7]  Nimagna Biswas,et al.  A parameter independent fuzzy weighted k-Nearest neighbor classifier , 2018, Pattern Recognit. Lett..

[8]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[9]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[10]  James M. Keller,et al.  A fuzzy K-nearest neighbor algorithm , 1985, IEEE Transactions on Systems, Man, and Cybernetics.

[11]  Francisco Herrera,et al.  Transforming big data into smart data: An insight on the use of the k‐nearest neighbors algorithm to obtain quality data , 2018, WIREs Data Mining Knowl. Discov..

[12]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[13]  Marco Zaffalon,et al.  Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis , 2016, J. Mach. Learn. Res..

[14]  Andrew Beng Jin Teoh,et al.  Evolutionary-modified fuzzy nearest-neighbor rule for pattern classification , 2017, Expert Syst. Appl..

[15]  Khalid Zenkouar,et al.  A new nearest neighbor classification method based on fuzzy set theory and aggregation operators , 2017, Expert Syst. Appl..

[16]  Yahya Forghani,et al.  Increasing the speed of fuzzy k-nearest neighbours algorithm , 2018, Expert Syst. J. Knowl. Eng..

[17]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[18]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[19]  Jeffrey K. Uhlmann,et al.  Satisfying General Proximity/Similarity Queries with Metric Trees , 1991, Inf. Process. Lett..

[20]  Eric Gossett,et al.  Big Data: A Revolution That Will Transform How We Live, Work, and Think , 2015 .

[21]  Rui Huang,et al.  Research on fault diagnosis method of spacecraft solar array based on f-KNN algorithm , 2017, 2017 Prognostics and System Health Management Conference (PHM-Harbin).

[22]  Francisco Herrera,et al.  Fuzzy nearest neighbor algorithms: Taxonomy, experimental analysis and prospects , 2014, Inf. Sci..

[23]  Ethem Alpaydin,et al.  Introduction to machine learning , 2004, Adaptive computation and machine learning.

[24]  Anil K. Ghosh,et al.  On some transformations of high dimension, low sample size data for nearest neighbor classification , 2015, Machine Learning.

[25]  Patrick Wendell,et al.  Learning Spark: Lightning-Fast Big Data Analytics , 2015 .

[26]  Francisco Herrera,et al.  Evolutionary fuzzy k-nearest neighbors algorithm using interval-valued fuzzy sets , 2016, Inf. Sci..

[27]  Francisco Herrera,et al.  kNN-IS: An Iterative Spark-based design of the k-Nearest Neighbors classifier for big data , 2017, Knowl. Based Syst..

[28]  Sankha Subhra Mullick,et al.  On Convergence of the Class Membership Estimator in Fuzzy $k$-Nearest Neighbor Classifier , 2019, IEEE Transactions on Fuzzy Systems.

[29]  Francisco Herrera,et al.  rNPBST: An R Package Covering Non-parametric and Bayesian Statistical Tests , 2017, HAIS.

[30]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[31]  Piero P. Bonissone,et al.  A fuzzy K-nearest neighbor classifier to deal with imperfect data , 2018, Soft Comput..

[32]  Francisco Herrera,et al.  Data Preprocessing in Data Mining , 2014, Intelligent Systems Reference Library.

[33]  Francisco Herrera,et al.  A preliminary study on Hybrid Spill-Tree Fuzzy k-Nearest Neighbors for big data classification , 2018, 2018 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE).