论文信息 - Light-weight top-k string similarity search with a two-level inverted index

Light-weight top-k string similarity search with a two-level inverted index

Given a query string and a collection of strings, the top-k string similarity search is to find the k most similar strings in the collection to the query string based on edit distance. Most existing works have focused on a filter-and-verify framework to prune non-candidates with some lower bounds of edit distance. The best current implementations require more than 10 seconds answering a top-40 query for a real eBay dataset. In this paper, we propose a novel light-weight algorithm to answer the top-40 eBay queries in around 350 milliseconds. Unlike existing work, the answer is approximate, but we show that more than 95% of the final results are returned.

Xiaoli Wang | Yuhui Zheng | Tianzhi Deng

[1] Sunita Sarawagi,et al. Efficient set joins on similarity predicates , 2004, SIGMOD '04.

[2] Anthony K. H. Tung,et al. Efficient and Effective KNN Sequence Search with Approximate n-grams , 2013, Proc. VLDB Endow..

[3] Guoliang Li,et al. String similarity search and join: a survey , 2016, Frontiers of Computer Science.

[4] Jiaheng Lu,et al. Efficient Merging and Filtering Algorithms for Approximate String Searches , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[5] Cyril N. Alberga,et al. String similarity and misspellings , 1967, CACM.

[6] L. R. Dice. Measures of the Amount of Ecologic Association Between Species , 1945 .

[7] Patrick Valduriez,et al. Proceedings of the 2004 ACM SIGMOD international conference on Management of data , 2004, SIGMOD 2004.

[8] Wen-Syan Li,et al. Top-k string similarity search with edit-distance constraints , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[9] Marcus A. Badgeley,et al. Hybrid Bayesian-rank integration approach improves the predictive power of genomic dataset aggregation , 2015, Bioinform..

[10] Beng Chin Ooi,et al. Bed-tree: an all-purpose index structure for string similarity search based on edit distance , 2010, SIGMOD Conference.

[11] Zhenglu Yang,et al. Fast Algorithms for Top-k Approximate String Matching , 2010, AAAI.

[12] Jin Wang,et al. Two birds with one stone: An efficient hierarchical framework for top-k and threshold-based string similarity search , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[13] Efstathios Stamatatos,et al. Intrinsic Plagiarism Detection Using Character n-gram Profiles , 2009 .

[14] P. Jaccard. THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE.1 , 1912 .

[15] Lei Chen,et al. Cleaning uncertain data with a noisy crowd , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[16] Bin Wang,et al. VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams , 2007, VLDB.