kNN-IS: An Iterative Spark-based design of the k-Nearest Neighbors classifier for big data

The k-Nearest Neighbors classifier is a simple yet effective widely renowned method in data mining. The actual application of this model in the big data domain is not feasible due to time and memory restrictions. Several distributed alternatives based on MapReduce have been proposed to enable this method to handle large-scale data. However, their performance can be further improved with new designs that fit with newly arising technologies.In this work we provide a new solution to perform an exact k-nearest neighbor classification based on Spark. We take advantage of its in-memory operations to classify big amounts of unseen cases against a big training dataset. The map phase computes the k-nearest neighbors in different training data splits. Afterwards, multiple reducers process the definitive neighbors from the list obtained in the map phase. The key point of this proposal lies on the management of the test set, keeping it in memory when possible. Otherwise, it is split into a minimum number of pieces, applying a MapReduce per chunk, using the caching skills of Spark to reuse the previously partitioned training set. In our experiments we study the differences between Hadoop and Spark implementations with datasets up to 11 million instances, showing the scaling-up capabilities of the proposed approach. As a result of this work an open-source Spark package is available.

[1]  Helmut Krcmar,et al.  Big Data , 2014, Wirtschaftsinf..

[2]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[3]  C. L. Philip Chen,et al.  Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..

[4]  Francisco Herrera,et al.  ROSEFW-RF: The winner algorithm for the ECBDL'14 big data competition: An extremely imbalanced big data bioinformatics problem , 2015, Knowl. Based Syst..

[5]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[6]  Joseph M. Hellerstein,et al.  GraphLab: A New Framework For Parallel Machine Learning , 2010, UAI.

[7]  Justine Rochas,et al.  Solutions for Processing K Nearest Neighbor Joins for Massive Data on MapReduce , 2015, 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[8]  Beng Chin Ooi,et al.  Efficient Processing of k Nearest Neighbor Joins using MapReduce , 2012, Proc. VLDB Endow..

[9]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[10]  RahimiAli,et al.  Similarity-based Classification: Concepts and Algorithms , 2009 .

[11]  Patrick Wendell,et al.  Learning Spark: Lightning-Fast Big Data Analytics , 2015 .

[12]  Juha Heinanen,et al.  OF DATA INTENSIVE APPLICATIONS , 1986 .

[13]  M. Kubát An Introduction to Machine Learning , 2017, Springer International Publishing.

[14]  Mustapha Lebbah,et al.  Micro-Batching Growing Neural Gas for Clustering Data Streams Using Spark Streaming , 2015, INNS Conference on Big Data.

[16]  Xindong Wu,et al.  The Top Ten Algorithms in Data Mining , 2009 .

[17]  Maya R. Gupta,et al.  Similarity-based Classification: Concepts and Algorithms , 2009, J. Mach. Learn. Res..

[18]  Jimmy J. Lin MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That's Not a Nail! , 2012, Big Data.

[19]  Xiao Qin,et al.  A parallel algorithm for mining constrained frequent patterns using MapReduce , 2017, Soft Comput..

[20]  Miriam A. M. Capretz,et al.  Challenges for MapReduce in Big Data , 2014, 2014 IEEE World Congress on Services.

[21]  GhemawatSanjay,et al.  The Google file system , 2003 .

[22]  Francisco Herrera,et al.  On the choice of the best imputation methods for missing values considering three groups of classification methods , 2012, Knowledge and Information Systems.

[23]  Francisco Herrera,et al.  A MapReduce-Based k-Nearest Neighbor Approach for Big Data Classification , 2015, TrustCom 2015.

[24]  María José del Jesús,et al.  Big Data with Cloud Computing: an insight on the computing environment, MapReduce, and programming frameworks , 2014, WIREs Data Mining Knowl. Discov..

[25]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[26]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[27]  Ian Witten,et al.  Data Mining , 2000 .

[28]  Kunle Olukotun,et al.  Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[29]  G. Priya,et al.  EFFICIENT KNN CLASSIFICATION ALGORITHM FOR BIG DATA , 2017 .

[30]  LeeWang-Chien,et al.  Distributed In-Memory Processing of All k Nearest Neighbor Queries , 2016 .

[31]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[32]  C. Lynch Big data: How do your data grow? , 2008, Nature.

[33]  Ashwin Srinivasan,et al.  Data and task parallelism in ILP using MapReduce , 2011, Machine Learning.

[34]  Joseph E. Gonzalez,et al.  GraphLab: A New Parallel Framework for Machine Learning , 2010 .

[35]  Christina Freytag,et al.  Using Mpi Portable Parallel Programming With The Message Passing Interface , 2016 .

[36]  Ho-Hyun Park,et al.  Tagging and classifying facial images in cloud environments based on KNN using MapReduce , 2015 .

[37]  Geoffrey C. Fox,et al.  Investigation of Data Locality in MapReduce , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[38]  Feifei Li,et al.  Efficient parallel kNN joins for large data in MapReduce , 2012, EDBT '12.

[39]  Francisco Herrera,et al.  Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study , 2015, Knowledge and Information Systems.

[40]  Francisco Herrera,et al.  MRPR: A MapReduce solution for prototype reduction in big data classification , 2015, Neurocomputing.

[41]  Evaggelia Pitoura,et al.  Distributed In-Memory Processing of All k Nearest Neighbor Queries , 2016, IEEE Transactions on Knowledge and Data Engineering.

[42]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[43]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[44]  Michael Minelli,et al.  Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today's Businesses , 2012 .

[45]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[46]  ZhangShichao,et al.  Efficient kNN classification algorithm for big data , 2016 .

[47]  Juan José Rodríguez Diez,et al.  Instance selection of linear complexity for big data , 2016, Knowl. Based Syst..

[48]  William Gropp,et al.  Skjellum using mpi: portable parallel programming with the message-passing interface , 1994 .

[49]  Jesús Alcalá-Fdez,et al.  KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework , 2011, J. Multiple Valued Log. Soft Comput..

[50]  Hongjie Jia,et al.  Study on density peaks clustering based on k-nearest neighbors and principal component analysis , 2016, Knowl. Based Syst..

[51]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.