Improving the Performance of kNN in the MapReduce Framework Using Locality Sensitive Hashing

In this work the authors present a parallel k nearest neighbor (kNN) algorithm using locality sensitive hashing to preprocess the data before it is classified using kNN in Hadoop's MapReduce framework. This is compared with the sequential (conventional) implementation. Using locality sensitive hashing's similarity measure with kNN, the iterative procedure to classify a data object is performed within a hash bucket rather than the whole data set, greatly reducing the computation time needed for classification. Several experiments were run that showed that the parallel implementation performed better than the sequential implementation on very large datasets. The study also experimented with a few map and reduce side optimization features for the parallel implementation and presented some optimum map and reduce side parameters. Among the map side parameters, the block size and input split size were varied, and among the reduce side parameters, the number of planes were varied, and their effects were studied.

[1]  Jon Louis Bentley,et al.  K-d trees for semidynamic point sets , 1990, SCG '90.

[2]  M. Slaney,et al.  Locality-Sensitive Hashing for Finding Nearest Neighbors [Lecture Notes] , 2008, IEEE Signal Processing Magazine.

[3]  N. Altman An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression , 1992 .

[4]  Divyakant Agrawal,et al.  Content-Based Similarity Search over Peer-to-Peer Systems , 2004, DBISP2P.

[5]  Xiao Qin,et al.  $k$ NN-DP: Handling Data Skewness in $kNN$ Joins Using MapReduce , 2018, IEEE Transactions on Parallel and Distributed Systems.

[6]  Kaizhu Huang,et al.  Fast kNN Graph Construction with Locality Sensitive Hashing , 2013, ECML/PKDD.

[7]  Kuhu Pal,et al.  Breast cancer detection using rank nearest neighbor classification rules , 2003, Pattern Recognit..

[8]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[9]  Mayank Bawa,et al.  LSH forest: self-tuning indexes for similarity search , 2005, WWW '05.

[10]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[11]  Pavel Zezula,et al.  A Content-Addressable Network for Similarity Search in Metric Spaces , 2005, DBISP2P.

[12]  Karl Aberer,et al.  Distributed similarity search in high dimensions using locality sensitive hashing , 2009, EDBT '09.

[13]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[14]  Christos Doulkeridis,et al.  Peer-to-peer similarity search over widely distributed document collections , 2008, LSDS-IR '08.

[15]  Pavel Zezula,et al.  Similarity Searching in Structured and Unstructured P2P Networks , 2009, QSHINE.

[16]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[17]  Kaushik Roy,et al.  The k-Nearest Neighbor Algorithm Using MapReduce Paradigm , 2014, 2014 5th International Conference on Intelligent Systems, Modelling and Simulation.

[18]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.

[19]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).