An Adaptive Similarity Search in Massive Datasets

Similarity search is an important task engaging in different fields of studies as well as in various application domains. The era of big data, however, has been posing challenges on existing information systems in general and on similarity search in particular. Aiming at large-scale data processing, we propose an adaptive similarity search in massive datasets with MapReduce. Additionally, our proposed scheme is both applicable and adaptable to popular similarity search cases such as pairwise similarity, search-by-example, range queries, and k-Nearest Neighbour queries. Moreover, we embed our collaborative refinements to effectively minimize irrelevant data objects as well as unnecessary computations. Furthermore, we experience our proposed methods with the two different document models known as shingles and terms. Last but not least, we conduct intensive empirical experiments not only to verify these methods themselves but also to compare them with a previous related work on real datasets. The results, after all, confirm the effectiveness of our proposed methods and show that they outperform the previous work in terms of query processing.

[1]  Tran Khanh Dang,et al.  The SH-tree: A Super Hybrid Index Structure for Multidimensional Data , 2001, DEXA.

[2]  Radoslaw Szmit Locality Sensitive Hashing for Similarity Search Using MapReduce on Large Scale Data , 2013, IIS.

[3]  Tran Khanh Dang,et al.  Solving approximate similarity queries , 2007, Comput. Syst. Sci. Eng..

[4]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[5]  Tao Yang,et al.  Optimizing parallel algorithms for all pairs similarity search , 2013, WSDM.

[6]  Jure Lescovek Finding Similar Items , 2012 .

[7]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[8]  Jimmy J. Lin,et al.  No Free Lunch: Brute Force vs. Locality-Sensitive Hashing for Cross-lingual Pairwise Similarity , 2011, SIGIR '11.

[9]  Tran Khanh Dang,et al.  An Efficient Similarity Search in Large Data Collections with MapReduce , 2014, FDSE.

[10]  Andreas Paepcke,et al.  SpotSigs: robust and efficient near duplicate detection in large web collections , 2008, SIGIR '08.

[11]  Felix Naumann,et al.  Efficient Similarity Search in Very Large String Sets , 2012, SSDBM.

[12]  Tran Khanh Dang,et al.  An Elastic Approximate Similarity Search in Very Large Datasets with MapReduce , 2014, Globe.

[13]  Yao Hu,et al.  A Unified Approximate Nearest Neighbor Search Scheme by Combining Data Structure and Hashing , 2013, IJCAI.

[14]  Ranieri Baraglia,et al.  Scaling Out All Pairs Similarity Search with MapReduce , 2010, LSDS-IR@SIGIR.

[15]  Li Ju,et al.  Batch Text Similarity Search with MapReduce , 2011, APWeb.

[16]  Jimmy J. Lin,et al.  Pairwise Document Similarity in Large Collections with MapReduce , 2008, ACL.

[17]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[18]  Anthony K. H. Tung,et al.  Efficient and Scalable Processing of String Similarity Join , 2013, IEEE Transactions on Knowledge and Data Engineering.

[19]  Ranieri Baraglia,et al.  Document Similarity Self-Join with MapReduce , 2010, 2010 IEEE International Conference on Data Mining.

[20]  Lawrence Philips,et al.  The double metaphone search algorithm , 2000 .