Efficient Similarity Search in Very Large String Sets

String similarity search is required by many real-life applications, such as spell checking, data cleansing, fuzzy keyword search, or comparison of DNA sequences. Given a very large string set and a query string, the string similarity search problem is to efficiently find all strings in the string set that are similar to the query string. Similarity is defined using a similarity (or distance) measure, such as edit distance or Hamming distance. In this paper, we introduce the State Set Index (SSI) as an efficient solution for this search problem. SSI is based on a trie (prefix index) that is interpreted as a nondeterministic finite automaton. SSI implements a novel state labeling strategy making the index highly space-efficient. Furthermore, SSI's space consumption can be gracefully traded against search time. We evaluated SSI on different sets of person names with up to 170 million strings from a social network and compared it to other state-of-the-art methods. We show that in the majority of cases, SSI is significantly faster than other tools and requires less index space.

[1]  Wojciech Rytter,et al.  Extracting Powers and Periods in a String from Its Runs Structure , 2010, SPIRE.

[2]  Jian Pei,et al.  Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[3]  Eugene W. Myers,et al.  A sublinear algorithm for approximate keyword searching , 1994, Algorithmica.

[4]  Jiaheng Lu,et al.  Efficient Merging and Filtering Algorithms for Approximate String Searches , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[5]  Gösta Grahne,et al.  Efficiently Using Prefix-trees in Mining Frequent Itemsets , 2003, FIMI.

[6]  R. Ewy,et al.  ABSTRACT , 1986 .

[7]  Guoliang Li,et al.  Effective Indices for Efficient Approximate String Search and Similarity Join , 2008, 2008 The Ninth International Conference on Web-Age Information Management.

[8]  Z. Meral Özsoyoglu,et al.  Distance based indexing for string proximity search , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[9]  Chengqi Zhang,et al.  Efficient approximate entity extraction with edit distance constraints , 2009, SIGMOD Conference.

[10]  Eugene W. Myers A Fast Bit-Vector Algorithm for Approximate String Matching Based on Dynamic Programming , 1998, CPM.

[11]  Donald R. Morrison,et al.  PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric , 1968, J. ACM.

[12]  Xuemin Lin,et al.  Ed-Join: an efficient algorithm for similarity joins with edit distance constraints , 2008, Proc. VLDB Endow..

[13]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[14]  Luis Gravano,et al.  Text joins in an RDBMS for web data integration , 2003, WWW '03.

[15]  Vikram Pudi,et al.  Using Prefix-Trees for Efficiently Computing Set Joins , 2005, DASFAA.

[16]  Ulf Leser,et al.  Prefix Tree Indexing for Similarity Search and Similarity Joins on Genomic Data , 2010, SSDBM.

[17]  H. Bast,et al.  Fast error-tolerant search on very large texts , 2009, SAC '09.

[18]  Ulf Leser,et al.  Scalable Sequence Similarity Search and Join in Main Memory on Multi-cores , 2011, Euro-Par Workshops.

[19]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[20]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[21]  Marianne Winslett,et al.  Scientific and Statistical Database Management, 21st International Conference, SSDBM 2009, New Orleans, LA, USA, June 2-4, 2009, Proceedings , 2009, SSDBM.

[22]  T. K. Vintsyuk Speech discrimination by dynamic programming , 1968 .

[23]  T. H. Merrett,et al.  Tries for Approximate String Matching , 1996, IEEE Trans. Knowl. Data Eng..

[24]  James W. Fickett,et al.  Fast optimal alignment , 1984, Nucleic Acids Res..

[25]  Dana S. Scott,et al.  Finite Automata and Their Decision Problems , 1959, IBM J. Res. Dev..

[26]  Edward Fredkin,et al.  Trie memory , 1960, Commun. ACM.

[27]  Divyakant Agrawal,et al.  BFT: Bit Filtration Technique for Approximate String Join in Biological Databases , 2003, SPIRE.

[28]  Dan Gusfield Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .