Exact fuzzy k-nearest neighbor classification for big datasets

The k-Nearest Neighbors (kNN) classifier is one of the most effective methods in supervised learning problems. It classifies unseen cases comparing their similarity with the training data. Nevertheless, it gives to each labeled sample the same importance to classify. There are several approaches to enhance its precision, with the Fuzzy k-Nearest Neighbors (Fuzzy-kNN) classifier being among the most successful ones. Fuzzy-kNN computes a fuzzy degree of membership of each instance to the classes of the problem. As a result, it generates smoother borders between classes. Apart from the existing kNN approach to handle big datasets, there is not a fuzzy variant to manage that volume of data. Nevertheless, calculating this class membership adds an extra computational cost becoming even less scalable to tackle large datasets because of memory needs and high runtime. In this work, we present an exact and distributed approach to run the Fuzzy-kNN classifier on big datasets based on Spark, which provides the same precision than the original algorithm. It presents two separately stages. The first stage transforms the training set adding the class membership degrees. The second stage classifies with the kNN algorithm the test set using the class membership computed previously. In our experiments, we study the scaling-up capabilities of the proposed approach with datasets up to 11 million instances, showing promising results.

[1]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[2]  Gang Wang,et al.  A novel bankruptcy prediction model based on an adaptive fuzzy k-nearest neighbor method , 2011, Knowl. Based Syst..

[3]  Muthu Dayalan,et al.  MapReduce : Simplified Data Processing on Large Cluster , 2018 .

[4]  HerreraFrancisco,et al.  Big Data with Cloud Computing , 2014 .

[5]  Gang Wang,et al.  An efficient diagnosis system for detection of Parkinson's disease using fuzzy k-nearest neighbor approach , 2013, Expert Syst. Appl..

[6]  G. Priya,et al.  EFFICIENT KNN CLASSIFICATION ALGORITHM FOR BIG DATA , 2017 .

[7]  Osman Hegazy,et al.  A mapreduce fuzzy techniques of big data classification , 2016, 2016 SAI Computing Conference (SAI).

[8]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[9]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[10]  Francisco Herrera,et al.  Fuzzy nearest neighbor algorithms: Taxonomy, experimental analysis and prospects , 2014, Inf. Sci..

[11]  Ethem Alpaydin,et al.  Introduction to machine learning , 2004, Adaptive computation and machine learning.

[12]  Shichao Zhang,et al.  Efficient kNN classification algorithm for big data , 2016, Neurocomputing.

[13]  Christina Freytag,et al.  Using Mpi Portable Parallel Programming With The Message Passing Interface , 2016 .

[14]  James M. Keller,et al.  A fuzzy K-nearest neighbor algorithm , 1985, IEEE Transactions on Systems, Man, and Cybernetics.

[15]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[16]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[17]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[18]  María José del Jesús,et al.  Big Data with Cloud Computing: an insight on the computing environment, MapReduce, and programming frameworks , 2014, WIREs Data Mining Knowl. Discov..

[19]  Francisco Herrera,et al.  kNN-IS: An Iterative Spark-based design of the k-Nearest Neighbors classifier for big data , 2017, Knowl. Based Syst..