论文信息 - An Adaptive Similarity Search in Massive Datasets

An Adaptive Similarity Search in Massive Datasets

Similarity search is an important task engaging in different fields of studies as well as in various application domains. The era of big data, however, has been posing challenges on existing information systems in general and on similarity search in particular. Aiming at large-scale data processing, we propose an adaptive similarity search in massive datasets with MapReduce. Additionally, our proposed scheme is both applicable and adaptable to popular similarity search cases such as pairwise similarity, search-by-example, range queries, and k-Nearest Neighbour queries. Moreover, we embed our collaborative refinements to effectively minimize irrelevant data objects as well as unnecessary computations. Furthermore, we experience our proposed methods with the two different document models known as shingles and terms. Last but not least, we conduct intensive empirical experiments not only to verify these methods themselves but also to compare them with a previous related work on real datasets. The results, after all, confirm the effectiveness of our proposed methods and show that they outperform the previous work in terms of query processing.

Tran Khanh Dang | Josef Küng | Trong Nhan Phan

[1] Tran Khanh Dang,et al. The SH-tree: A Super Hybrid Index Structure for Multidimensional Data , 2001, DEXA.

[2] Radoslaw Szmit. Locality Sensitive Hashing for Similarity Search Using MapReduce on Large Scale Data , 2013, IIS.

[3] Tran Khanh Dang,et al. Solving approximate similarity queries , 2007, Comput. Syst. Sci. Eng..

[4] Chen Li,et al. Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[5] Tao Yang,et al. Optimizing parallel algorithms for all pairs similarity search , 2013, WSDM.

[6] Jure Lescovek. Finding Similar Items , 2012 .

[7] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[8] Jimmy J. Lin,et al. No Free Lunch: Brute Force vs. Locality-Sensitive Hashing for Cross-lingual Pairwise Similarity , 2011, SIGIR '11.

[9] Tran Khanh Dang,et al. An Efficient Similarity Search in Large Data Collections with MapReduce , 2014, FDSE.

[10] Andreas Paepcke,et al. SpotSigs: robust and efficient near duplicate detection in large web collections , 2008, SIGIR '08.

[11] Felix Naumann,et al. Efficient Similarity Search in Very Large String Sets , 2012, SSDBM.