The Token Distribution Filter for Approximate String Membership

A common application over web data is to find all the strings in a collection of pages that match strings in a given dictionar y. We consider the problem of extracting all the strings or substr ings in a document (or a page) that approximately match some string in a given dictionary. The current state-of-art approach for th is problem involves first applying an approximate, fast filter, then app lying a more expensive exact verification algorithm to the strings t hat survive the filter. Many string filters, such as the length filter a nd prefix filter, have been proposed. However, we find many string filter s are ineffective or inefficient in some problem scenarios. In thi s paper, we propose a new filter, the TDF (token distribution filter). W e conduct experiments on both synthetic and real data sets, and show that for a wide class of problems it performs better than previously proposed filters.

[1]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[2]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[3]  Jiaheng Lu,et al.  Efficient Merging and Filtering Algorithms for Approximate String Searches , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[4]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[5]  Sunita Sarawagi,et al.  Efficient Batch Top-k Search for Dictionary-based Entity Recognition , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[6]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[7]  Xuemin Lin,et al.  Ed-Join: an efficient algorithm for similarity joins with edit distance constraints , 2008, Proc. VLDB Endow..

[8]  Divesh Srivastava,et al.  Fast Indexes and Algorithms for Set Similarity Selection Queries , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[9]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[10]  Ömer Egecioglu,et al.  Dictionary Look-Up within Small Edit Distance , 2002, COCOON.

[11]  Sunita Sarawagi,et al.  ABSTRACT Efficient set joins on similarity predicates , 2004 .

[12]  Surajit Chaudhuri,et al.  An efficient filter for approximate membership checking , 2008, SIGMOD Conference.

[13]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[14]  Divesh Srivastava,et al.  Flexible String Matching Against Large Databases in Practice , 2004, VLDB.

[15]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.