Pivot-based approximate k-NN similarity joins for big high-dimensional data

Abstract Given an appropriate similarity model, the k -nearest neighbor similarity join represents a useful yet costly operator for data mining, data analysis and data exploration applications. The time to evaluate the operator depends on the size of datasets, data distribution and the dimensionality of data representations. For vast volumes of high-dimensional data, only distributed and approximate approaches make the joins practically feasible. In this paper, we investigate and evaluate the performance of multiple MapReduce-based approximate k-NN similarity join approaches on two leading Big Data systems Apache Hadoop and Spark. Focusing on the metric space approach relying on reference dataset objects (pivots), this paper investigates distributed similarity join techniques with and without approximation guarantees and also proposes high-dimensional extensions to previously proposed algorithms. The paper describes the design guidelines, algorithmic details, and key theoretical underpinnings of the compared approaches and also presents the empirical performance evaluation, approximation precision, and scalability properties of the implemented algorithms. Moreover, the Spark source code of all these algorithms has been made publicly available. Key findings of the experimental analysis are that randomly initialized pivot-based methods perform well with big high-dimensional data and that, in general, the selection of the best algorithm depends on the desired levels of approximation guarantee, precision and execution time.

[1]  Sebastian Michel,et al.  RankReduce - Processing K-Nearest Neighbor Queries on Top of MapReduce , 2010, LSDS-IR@SIGIR.

[2]  Xiaofeng Meng,et al.  Parallel similarity joins on massive high‐dimensional data using MapReduce , 2016, Concurr. Comput. Pract. Exp..

[3]  Luca Rossetto,et al.  Interactive video search tools: a detailed analysis of the video browser showdown 2015 , 2016, Multimedia Tools and Applications.

[4]  E. Chavez,et al.  Pivot selection techniques for proximity searching in metric spaces , 2001, SCCC 2001. 21st International Conference of the Chilean Computer Science Society.

[5]  Gylfi Þór Guðmundsson,et al.  Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark , 2017, MMSys.

[6]  Wenming Qiu,et al.  Efficient k-Nearest Neighbors Search in High Dimensions Using MapReduce , 2015, 2015 IEEE Fifth International Conference on Big Data and Cloud Computing.

[7]  Clara Pizzuti,et al.  Outlier mining in large high-dimensional data sets , 2005, IEEE Transactions on Knowledge and Data Engineering.

[8]  Hanan Samet,et al.  Metric space similarity joins , 2008, TODS.

[9]  Gonzalo Navarro,et al.  Effective Proximity Retrieval by Ordering Permutations , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  David G. Lowe,et al.  Scalable Nearest Neighbor Algorithms for High Dimensional Data , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Laurent Amsaleg,et al.  Terabyte-scale image similarity search: Experience and best practice , 2013, 2013 IEEE International Conference on Big Data.

[12]  Marco Patella,et al.  PAC nearest neighbor queries: Approximate and controlled search in high-dimensional and metric spaces , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[13]  Hanan Samet,et al.  Incremental distance join algorithms for spatial databases , 1998, SIGMOD '98.

[14]  Beng Chin Ooi,et al.  Efficient Processing of k Nearest Neighbor Joins using MapReduce , 2012, Proc. VLDB Endow..

[15]  Laurent Amsaleg,et al.  Indexing and searching 100M images with map-reduce , 2013, ICMR.

[16]  Jan Kohout,et al.  Unsupervised detection of malware in persistent web traffic , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Christian Böhm,et al.  Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data , 2001, SIGMOD '01.

[18]  David Novak,et al.  Metric Index: An Efficient and Scalable Solution for Similarity Search , 2009, 2009 Second International Workshop on Similarity Search and Applications.

[19]  Jean-Michel Marin,et al.  Bayesian Modelling and Inference on Mixtures of Distributions , 2005 .

[20]  Marco Patella,et al.  PAC Nearest Neighbor Queries: Using the Distance Distribution for Searching in High-Dimensional Metric Spaces , 1999, SEBD.

[21]  Yang Xu,et al.  Efficient Snapshot KNN Join Processing for Large Data Using MapReduce , 2016, 2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS).

[22]  Xiaoyong Du,et al.  Fast and Scalable Distributed Set Similarity Joins for Big Data Analytics , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[23]  Yasin N. Silva,et al.  Similarity Joins: Their implementation and interactions with other database operators , 2015, Inf. Syst..

[24]  Guoliang Li,et al.  Trie-join , 2010, Proc. VLDB Endow..

[25]  Jianwen Su,et al.  Efficient index-based KNN join processing for high-dimensional data , 2007, Inf. Softw. Technol..

[26]  Walid G. Aref,et al.  Efficient Processing of Hamming-Distance-Based Similarity-Search Queries Over MapReduce , 2015, EDBT.

[27]  Christian Böhm,et al.  The k-Nearest Neighbour Join: Turbo Charging the KDD Process , 2004, Knowledge and Information Systems.

[28]  Hanan Samet,et al.  Distance browsing in spatial databases , 1999, TODS.

[29]  Yasin N. Silva,et al.  Exploiting MapReduce-based similarity joins , 2012, SIGMOD Conference.

[30]  Jakub Lokoc,et al.  k-NN Classification of Malware in HTTPS Traffic Using the Metric Space Approach , 2016, PAISI.

[31]  Pavel Zezula,et al.  Similarity Search - The Metric Space Approach , 2005, Advances in Database Systems.

[32]  Divyakant Agrawal,et al.  Approximate nearest neighbor searching in multimedia databases , 2001, Proceedings 17th International Conference on Data Engineering.

[33]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[34]  Jakub Lokoc,et al.  Learning communication patterns for malware discovery in HTTPs data , 2018, Expert Syst. Appl..

[35]  Youzhong Ma,et al.  A novel approach for high‐dimensional vector similarity join query , 2017, Concurr. Comput. Pract. Exp..

[36]  Reynold Xin,et al.  Apache Spark , 2016 .

[37]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[38]  Younghoon Kim,et al.  Parallel computation of k-nearest neighbor joins using MapReduce , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[39]  Yasin N. Silva,et al.  MapReduce-based similarity join for metric spaces , 2012, Cloud-I '12.

[40]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[41]  George Awad,et al.  On Influential Trends in Interactive Video Retrieval: Video Browser Showdown 2015–2017 , 2018, IEEE Transactions on Multimedia.

[42]  Hans-Peter Kriegel,et al.  Fast and Scalable Outlier Detection with Approximate Nearest Neighbor Ensembles , 2015, DASFAA.

[43]  Feifei Li,et al.  Efficient parallel kNN joins for large data in MapReduce , 2012, EDBT '12.

[44]  Justine Rochas,et al.  K Nearest Neighbour Joins for Big Data on MapReduce: A Theoretical and Experimental Analysis , 2016, IEEE Transactions on Knowledge and Data Engineering.

[45]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[46]  Feifei Li,et al.  K nearest neighbor queries and kNN-Joins in large relational databases (almost) for free , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[47]  Giorgio Giacinto,et al.  A nearest-neighbor approach to relevance feedback in content based image retrieval , 2007, CIVR '07.

[48]  Xuemin Lin,et al.  Ed-Join: an efficient algorithm for similarity joins with edit distance constraints , 2008, Proc. VLDB Endow..

[49]  Marco Patella,et al.  The many facets of approximate similarity search , 2008, ICDE Workshops.

[50]  Jakub Lokoc,et al.  Comparing MapReduce-Based k-NN Similarity Joins on Hadoop for High-Dimensional Data , 2017, ADMA.

[51]  Justine Rochas,et al.  Solutions for Processing K Nearest Neighbor Joins for Massive Data on MapReduce , 2015, 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[52]  Jakub Lokoc,et al.  Feature Extraction and Malware Detection on Large HTTPS Data Using MapReduce , 2016, SISAP.

[53]  Ricardo A. Baeza-Yates,et al.  Searching in metric spaces , 2001, CSUR.

[54]  Jonathan G. Fiscus,et al.  TRECVID 2016: Evaluating Video Search, Video Event Detection, Localization, and Hyperlinking , 2016, TRECVID.