Light-weight top-k string similarity search with a two-level inverted index

Given a query string and a collection of strings, the top-k string similarity search is to find the k most similar strings in the collection to the query string based on edit distance. Most existing works have focused on a filter-and-verify framework to prune non-candidates with some lower bounds of edit distance. The best current implementations require more than 10 seconds answering a top-40 query for a real eBay dataset. In this paper, we propose a novel light-weight algorithm to answer the top-40 eBay queries in around 350 milliseconds. Unlike existing work, the answer is approximate, but we show that more than 95% of the final results are returned.

[1]  Sunita Sarawagi,et al.  Efficient set joins on similarity predicates , 2004, SIGMOD '04.

[2]  Anthony K. H. Tung,et al.  Efficient and Effective KNN Sequence Search with Approximate n-grams , 2013, Proc. VLDB Endow..

[3]  Guoliang Li,et al.  String similarity search and join: a survey , 2016, Frontiers of Computer Science.

[4]  Jiaheng Lu,et al.  Efficient Merging and Filtering Algorithms for Approximate String Searches , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[5]  Cyril N. Alberga,et al.  String similarity and misspellings , 1967, CACM.

[6]  L. R. Dice Measures of the Amount of Ecologic Association Between Species , 1945 .

[7]  Patrick Valduriez,et al.  Proceedings of the 2004 ACM SIGMOD international conference on Management of data , 2004, SIGMOD 2004.

[8]  Wen-Syan Li,et al.  Top-k string similarity search with edit-distance constraints , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[9]  Marcus A. Badgeley,et al.  Hybrid Bayesian-rank integration approach improves the predictive power of genomic dataset aggregation , 2015, Bioinform..

[10]  Beng Chin Ooi,et al.  Bed-tree: an all-purpose index structure for string similarity search based on edit distance , 2010, SIGMOD Conference.

[11]  Zhenglu Yang,et al.  Fast Algorithms for Top-k Approximate String Matching , 2010, AAAI.

[12]  Jin Wang,et al.  Two birds with one stone: An efficient hierarchical framework for top-k and threshold-based string similarity search , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[13]  Efstathios Stamatatos,et al.  Intrinsic Plagiarism Detection Using Character n-gram Profiles , 2009 .

[14]  P. Jaccard THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE.1 , 1912 .

[15]  Lei Chen,et al.  Cleaning uncertain data with a noisy crowd , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[16]  Bin Wang,et al.  VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams , 2007, VLDB.