kdANN+: A Rapid AkNN Classifier for Big Data

A k-nearest neighbor kNN query determines the k nearest points, using distance metrics, from a given location. An all k-nearest neighbor AkNN query constitutes a variation of a kNN query and retrieves the k nearest points for each point inside a database. Their main usage resonates in spatial databases and they consist the backbone of many location-based applications and not only. In this work, we propose a novel method for classifying multidimensional data using an AkNN algorithm in the MapReduce framework. Our approach exploits space decomposition techniques for processing the classification procedure in a parallel and distributed manner. To our knowledge, we are the first to study the kNN classification of multidimensional objects under this perspective. Through an extensive experimental evaluation we prove that our solution is efficient, robust and scalable in processing the given queries.

[1]  Jianwen Su,et al.  Efficient index-based KNN join processing for high-dimensional data , 2007, Inf. Softw. Technol..

[2]  Fuzhen Zhuang,et al.  Parallel Implementation of Classification Algorithms Based on MapReduce , 2010, RSKT.

[3]  Sebastian Michel,et al.  RankReduce - Processing K-Nearest Neighbor Queries on Top of MapReduce , 2010, LSDS-IR@SIGIR.

[4]  Feifei Li,et al.  K nearest neighbor queries and kNN-Joins in large relational databases (almost) for free , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[5]  Joel H. Saltz,et al.  Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce , 2013, Proc. VLDB Endow..

[6]  Mudhakar Srivatsa,et al.  Efficient spatial query processing for big data , 2014, SIGSPATIAL/GIS.

[7]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[8]  Yufei Tao,et al.  All-nearest-neighbors queries in spatial databases , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[9]  L. J. Boya,et al.  On Regular Polytopes , 2012, 1210.0601.

[10]  Yoshiharu Ishikawa,et al.  Processing All k-Nearest Neighbor Queries in Hadoop , 2012, WAIM.

[11]  Hanan Samet,et al.  The Quadtree and Related Hierarchical Data Structures , 1984, CSUR.

[12]  Jeffrey D. Ullman,et al.  Optimizing joins in a map-reduce environment , 2010, EDBT '10.

[13]  Nick Roussopoulos,et al.  Nearest neighbor queries , 1995, SIGMOD '95.

[14]  Joshua Zhexue Huang,et al.  Minimum Spanning Tree Based Classification Model for Massive Data with MapReduce Implementation , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[15]  Ioannis Konstantinou,et al.  Automated, Elastic Resource Provisioning for NoSQL Clusters Using TIRAMOLA , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[16]  Margaret H. Dunham,et al.  Data Mining: Introductory and Advanced Topics , 2002 .

[17]  Christian Böhm,et al.  The k-Nearest Neighbour Join: Turbo Charging the KDD Process , 2004, Knowledge and Information Systems.

[18]  Feifei Li,et al.  Efficient parallel kNN joins for large data in MapReduce , 2012, EDBT '12.

[19]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[20]  Marios D. Dikaiakos,et al.  Continuous All k-Nearest-Neighbor Querying in Smartphone Networks , 2012, 2012 IEEE 13th International Conference on Mobile Data Management.

[21]  Mahdi Abdelguerfi,et al.  Efficient AKNN spatial network queries using the M-Tree , 2007, GIS.

[22]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[23]  Beng Chin Ooi,et al.  Efficient Processing of k Nearest Neighbor Joins using MapReduce , 2012, Proc. VLDB Endow..

[24]  Ahmed Eldawy,et al.  SpatialHadoop: towards flexible and scalable spatial processing using mapreduce , 2014, SIGMOD'14 PhD Symposium.

[25]  Jignesh M. Patel,et al.  Efficient Evaluation of All-Nearest-Neighbor Queries , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[26]  Beng Chin Ooi,et al.  Gorder: An Efficient Method for KNN Join Processing , 2004, VLDB.

[27]  Panayiotis Bozanis,et al.  A network aware privacy model for online requests in trajectory data , 2009, Data Knowl. Eng..

[28]  Hans-Peter Kriegel,et al.  Optimizing All-Nearest-Neighbor Queries with Trigonometric Pruning , 2010, SSDBM.

[29]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.