A Lightweight Indexing Approach for Efficient Batch Similarity Processing with MapReduce

Similarity search is a principle operation in different fields of study. However, the cost for that operation is expensive due to several reasons, mainly by redundancy and big data load. There are many approaches that concentrate on how to speed up similarity search, especially with massive datasets, so that we can employ it for plenty of recent applications. In this paper, we study an efficient way for either single or batch similarity processing with MapReduce while minimizing redundant data by building lightweight indexes from the data and query sources. More specifically, we propose a general query processing scheme that not only handles a single query but also deals with sets of query in an incremental manner. In addition, we build the indexes in an ordered fashion, the so-called sorted inverted indexes, so that we can perform our quick pruning strategy that discards unrelated objects. Moreover, we embed metadata inside the indexes to reduce inessential duplicates. Last but not least, we measure our proposed solution by conducting empirical experiments on real datasets. The results verify the efficiency of our method when we do similarity search with query batches, especially when both query sets and datasets are large.

[1]  Tran Khanh Dang,et al.  eHSim: An Efficient Hybrid Similarity Search with MapReduce , 2016, 2016 IEEE 30th International Conference on Advanced Information Networking and Applications (AINA).

[2]  Tran Khanh Dang,et al.  An Adaptive Similarity Search in Massive Datasets , 2016, Trans. Large Scale Data Knowl. Centered Syst..

[3]  Srinivasan Parthasarathy,et al.  Bayesian Locality Sensitive Hashing for Fast Similarity Search , 2011, Proc. VLDB Endow..

[4]  Jin Wang,et al.  Two birds with one stone: An efficient hierarchical framework for top-k and threshold-based string similarity search , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[5]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[6]  Anthony K. H. Tung,et al.  Efficient and Scalable Processing of String Similarity Join , 2013, IEEE Transactions on Knowledge and Data Engineering.

[7]  Ranieri Baraglia,et al.  Document Similarity Self-Join with MapReduce , 2010, 2010 IEEE International Conference on Data Mining.

[8]  Ashish Goel,et al.  Dimension independent similarity computation , 2012, J. Mach. Learn. Res..

[9]  Tao Yang,et al.  Optimizing parallel algorithms for all pairs similarity search , 2013, WSDM.

[10]  Eui-nam Huh,et al.  An Index Scheme for Similarity Search on Cloud Computing using MapReduce over Docker Container , 2016, IMCOM.

[11]  Tran Khanh Dang,et al.  The SH-tree: A Super Hybrid Index Structure for Multidimensional Data , 2001, DEXA.

[12]  Pavel Zezula,et al.  Similarity Search: The Metric Space Approach (Advances in Database Systems) , 2005 .

[13]  Tran Khanh Dang,et al.  An Efficient Batch Similarity Processing with MapReduce , 2018, FDSE.

[14]  Christos Faloutsos,et al.  V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors , 2012, Proc. VLDB Endow..

[15]  Gang Chen,et al.  Metric Similarity Joins Using MapReduce , 2017, IEEE Transactions on Knowledge and Data Engineering.

[16]  Walid G. Aref,et al.  Efficient Processing of Hamming-Distance-Based Similarity-Search Queries Over MapReduce , 2015, EDBT.

[17]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.