An Efficient Batch Similarity Processing with MapReduce

In this paper, we study an efficient way for batch similarity processing with MapReduce. With the inverted index as a backbone, we embed metadata inside the indexes to minimize redundant data so as to build lightweight indexes from the data sources. In addition, we propose a general query batch processing scheme that not only handles a single query but also deals with sets of query in an incremental manner. Moreover, we build the indexes in an ordered fashion so that we can perform quick pruning discarding unnecessary objects and supporting the performance of similarity search. Last but not least, we measure our proposed solution by conducting empirical experiments on real datasets. The results verify the efficiency of our method when we do similarity search with query batches, especially when both query sets and data sets are large.

[1]  Jure Leskovec,et al.  Mining of Massive Datasets: Finding Similar Items , 2011 .

[2]  Tao Yang,et al.  Optimizing parallel algorithms for all pairs similarity search , 2013, WSDM.

[3]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[4]  Tran Khanh Dang,et al.  eHSim: An Efficient Hybrid Similarity Search with MapReduce , 2016, 2016 IEEE 30th International Conference on Advanced Information Networking and Applications (AINA).

[5]  Pavel Zezula,et al.  Similarity Search - The Metric Space Approach , 2005, Advances in Database Systems.

[6]  Ling Liu,et al.  Output privacy in data mining , 2011, TODS.

[7]  Eui-nam Huh,et al.  An Index Scheme for Similarity Search on Cloud Computing using MapReduce over Docker Container , 2016, IMCOM.

[8]  Jin Wang,et al.  Two birds with one stone: An efficient hierarchical framework for top-k and threshold-based string similarity search , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[9]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[10]  Walid G. Aref,et al.  Efficient Processing of Hamming-Distance-Based Similarity-Search Queries Over MapReduce , 2015, EDBT.

[11]  Srinivasan Parthasarathy,et al.  Bayesian Locality Sensitive Hashing for Fast Similarity Search , 2011, Proc. VLDB Endow..

[12]  Tran Khanh Dang,et al.  An Adaptive Similarity Search in Massive Datasets , 2016, Trans. Large Scale Data Knowl. Centered Syst..

[13]  Anthony K. H. Tung,et al.  Efficient and Scalable Processing of String Similarity Join , 2013, IEEE Transactions on Knowledge and Data Engineering.

[14]  Ashish Goel,et al.  Dimension independent similarity computation , 2012, J. Mach. Learn. Res..

[15]  Tran Khanh Dang,et al.  The SH-tree: A Super Hybrid Index Structure for Multidimensional Data , 2001, DEXA.

[16]  Christos Faloutsos,et al.  V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors , 2012, Proc. VLDB Endow..

[17]  Gang Chen,et al.  Metric Similarity Joins Using MapReduce , 2017, IEEE Transactions on Knowledge and Data Engineering.