A Grid-Based k-Nearest Neighbor Join for Large Scale Datasets on MapReduce

Because MapReduce supports efficient parallel data processing, MapReduce-based query processing algorithms have been widely studied. Among various query types, k-nearest neighbor join, which aims to produce the k nearest neighbors of each point of a dataset from another dataset, has been considered most important in data analysis. Existing k-NN join query processing algorithms on MapReduce suffer from high index construction and computation costs which make them unsuitable for big data processing. In this paper, we propose a new grid-based k-NN join query processing algorithm on MapReduce. First, we design a dynamic grid index that represents the distribution of join datasets. Based on this index, we prune out unnecessary cells for the join with the distance-based filtering. This can reduce the data transmission and computation overheads. From performance analysis, we show that our algorithm outperforms the existing scheme up to seven times in terms of query processing time while achieving high query result accuracy.

[1]  Feifei Li,et al.  K nearest neighbor queries and kNN-Joins in large relational databases (almost) for free , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[2]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD 2000.

[3]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[4]  W. Marsden I and J , 2012 .

[5]  Anthony K. H. Tung,et al.  MAP-JOIN-REDUCE: Toward Scalable and Efficient Data Analysis on Large Clusters , 2011, IEEE Transactions on Knowledge and Data Engineering.

[6]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[7]  Jianwen Su,et al.  Efficient index-based KNN join processing for high-dimensional data , 2007, Inf. Softw. Technol..

[8]  Jignesh M. Patel,et al.  A comparison of join algorithms for log processing in MaPreduce , 2010, SIGMOD Conference.

[9]  Christian Böhm,et al.  A cost model and index architecture for the similarity join , 2001, Proceedings 17th International Conference on Data Engineering.

[10]  Beng Chin Ooi,et al.  Gorder: An Efficient Method for KNN Join Processing , 2004, VLDB.

[11]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[12]  Feifei Li,et al.  Efficient parallel kNN joins for large data in MapReduce , 2012, EDBT '12.

[13]  Beng Chin Ooi,et al.  Indexing the Distance: An Efficient Method to KNN Processing , 2001, VLDB.

[14]  Jeffrey D. Ullman,et al.  Optimizing Multiway Joins in a Map-Reduce Environment , 2011, IEEE Transactions on Knowledge and Data Engineering.

[15]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[17]  Beng Chin Ooi,et al.  Efficient Processing of k Nearest Neighbor Joins using MapReduce , 2012, Proc. VLDB Endow..

[18]  Beng Chin Ooi,et al.  The performance of MapReduce , 2010, Proc. VLDB Endow..