Optimized Signature Selection for Efficient String Similarity Search

In this paper, we study the problem of string similarity search to retrieve in a database all strings similar to a query string within a given threshold. To measure the similarity between strings, we use edit distance. Many algorithms have been proposed under a filtering-and-verification framework to solve the problem. To reduce the overhead of edit distance verification, it is crucial to efficiently generate a small number of candidates in the filtering phase. Recently, an index structure named HSTree has been proposed for efficiently generating candidate strings. To generate candidates, they select and utilize HSTree nodes at a specific level calculated from a given threshold. In this paper, we observe that there are many alternative ways to select HSTree nodes, and propose a novel technique that selects HSTree nodes in an optimized way based on the observation. We also propose a modified HSTree, named a threaded HSTree, which connects inverted lists of an HSTree node to inverted lists of its child nodes. With a threaded HSTree, we can reduce the overhead of index lookups in HSTree nodes while selecting optimal tree nodes. Experimental results show that the proposed technique significantly outperforms the existing technique using the HSTree.

[1]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[2]  Luis Gravano,et al.  Selectivity estimation for string predicates: overcoming the underestimation problem , 2004, Proceedings. 20th International Conference on Data Engineering.

[3]  Jeffrey F. Naughton,et al.  Set Containment Joins: The Good, The Bad and The Ugly , 2000, VLDB.

[4]  Guoliang Li,et al.  Can we beat the prefix filtering?: an adaptive framework for similarity join and search , 2012, SIGMOD Conference.

[5]  Xiaohui Xie,et al.  Improving read mapping using additional prefix grams , 2014, BMC Bioinformatics.

[6]  Wei Jin,et al.  SAPPER: Subgraph Indexing and Approximate Matching in Large Graphs , 2010, Proc. VLDB Endow..

[7]  Wei Wang,et al.  HeteRank: A general similarity measure in heterogeneous information networks by integrating multi-type relationships , 2018, Inf. Sci..

[8]  Kaspar Riesen,et al.  Speeding Up Graph Edit Distance Computation with a Bipartite Heuristic , 2007, MLG.

[9]  Xuemin Lin,et al.  Ed-Join: an efficient algorithm for similarity joins with edit distance constraints , 2008, Proc. VLDB Endow..

[10]  Horst Bunke,et al.  A graph distance metric based on the maximal common subgraph , 1998, Pattern Recognit. Lett..

[11]  Bin Wang,et al.  Cost-based variable-length-gram selection for string collections to support approximate queries efficiently , 2008, SIGMOD Conference.

[12]  Tova Milo,et al.  Boosting SimRank with Semantics , 2019, EDBT.

[13]  Christos Faloutsos,et al.  Fast Random Walk with Restart and Its Applications , 2006, Sixth International Conference on Data Mining (ICDM'06).

[14]  Jin Wang,et al.  A unified framework for string similarity search with edit-distance constraint , 2016, The VLDB Journal.

[15]  Jeffrey Xu Yu,et al.  TreeSpan: efficiently computing similarity all-matching , 2012, SIGMOD Conference.

[16]  Yizhou Sun,et al.  P-Rank: a comprehensive structural similarity measure over information networks , 2009, CIKM.

[17]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[18]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[19]  Theo Härder,et al.  Efficient Set Similarity Joins Using Min-prefixes , 2009, ADBIS.

[20]  Patrick A. V. Hall,et al.  Approximate String Matching , 1994, Encyclopedia of Algorithms.

[21]  Jeffrey Xu Yu,et al.  Connected substructure similarity search , 2010, SIGMOD Conference.

[22]  Dong-Hoon Choi,et al.  Inves: Incremental Partitioning-Based Verification for Graph Similarity Search , 2019, EDBT.

[23]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[24]  Carlo Zaniolo,et al.  MF-Join: Efficient Fuzzy String Similarity Join with Multi-level Filtering , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[25]  Gina-Anne Levow,et al.  Term representation with Generalized Latent Semantic Analysis , 2007 .

[26]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[27]  Jian Pei,et al.  SimRank*: effective and scalable pairwise similarity search based on graph topology , 2019, The VLDB Journal.

[28]  Xuemin Lin,et al.  Top-k Set Similarity Joins , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[29]  King-Sun Fu,et al.  A distance measure between attributed relational graphs for pattern recognition , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[30]  Guoliang Li,et al.  Fast-join: An efficient method for fuzzy token matching based string similarity join , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[31]  Chen Li,et al.  Answering approximate string queries on large data sets using external memory , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[32]  Sunita Sarawagi,et al.  Efficient set joins on similarity predicates , 2004, SIGMOD '04.

[33]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[34]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[35]  Sunju Park,et al.  C-Rank: A link-based similarity measure for scientific literature databases , 2011, Inf. Sci..

[36]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[37]  Guoliang Li,et al.  PASS-JOIN: A Partition-based Method for Similarity Joins , 2011, Proc. VLDB Endow..

[38]  James L. Peterson,et al.  Computer programs for detecting and correcting spelling errors , 1980, CACM.

[39]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[40]  L. R. Dice Measures of the Amount of Ecologic Association Between Species , 1945 .

[41]  Bin Wang,et al.  VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams , 2007, VLDB.

[42]  LiGuoliang,et al.  A partition-based method for string similarity joins with edit-distance constraints , 2013 .

[43]  Xuemin Lin,et al.  Efficient exact edit similarity query processing with the asymmetric signature scheme , 2011, SIGMOD '11.

[44]  Jin Wang,et al.  Two birds with one stone: An efficient hierarchical framework for top-k and threshold-based string similarity search , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[45]  Jongik Kim,et al.  Efficient Exact Similarity Searches Using Multiple Token Orderings , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[46]  Jiaheng Lu,et al.  Space-Constrained Gram-Based Indexing for Efficient Approximate String Search , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[47]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[48]  Jongik Kim An effective candidate generation method for improving performance of edit similarity query processing , 2015, Inf. Syst..

[49]  Mehran Sahami,et al.  A web-based kernel function for measuring the similarity of short text snippets , 2006, WWW '06.

[50]  Jiaheng Lu,et al.  Efficient Merging and Filtering Algorithms for Approximate String Searches , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[51]  P. Jaccard,et al.  Etude comparative de la distribution florale dans une portion des Alpes et des Jura , 1901 .

[52]  Xiaohui Xie,et al.  Hobbes3: Dynamic generation of variable-length signatures for efficient approximate subsequence mappings , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[53]  Benno Stein,et al.  A Wikipedia-Based Multilingual Retrieval Model , 2008, ECIR.