论文信息 - The Token Distribution Filter for Approximate String Membership

The Token Distribution Filter for Approximate String Membership

A common application over web data is to find all the strings in a collection of pages that match strings in a given dictionar y. We consider the problem of extracting all the strings or substr ings in a document (or a page) that approximately match some string in a given dictionary. The current state-of-art approach for th is problem involves first applying an approximate, fast filter, then app lying a more expensive exact verification algorithm to the strings t hat survive the filter. Many string filters, such as the length filter a nd prefix filter, have been proposed. However, we find many string filter s are ineffective or inefficient in some problem scenarios. In thi s paper, we propose a new filter, the TDF (token distribution filter). W e conduct experiments on both synthetic and real data sets, and show that for a wide class of problems it performs better than previously proposed filters.

Jeffrey F. Naughton | Chong Sun | J. Naughton | Chong Sun

[1] Luis Gravano,et al. Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[2] Surajit Chaudhuri,et al. A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[3] Jiaheng Lu,et al. Efficient Merging and Filtering Algorithms for Approximate String Searches , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[4] Gonzalo Navarro,et al. A guided tour to approximate string matching , 2001, CSUR.

[5] Sunita Sarawagi,et al. Efficient Batch Top-k Search for Dictionary-based Entity Recognition , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[6] Piotr Indyk,et al. Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[7] Xuemin Lin,et al. Ed-Join: an efficient algorithm for similarity joins with edit distance constraints , 2008, Proc. VLDB Endow..

[8] Divesh Srivastava,et al. Fast Indexes and Algorithms for Set Similarity Selection Queries , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[9] Alfred V. Aho,et al. Efficient string matching , 1975, Commun. ACM.

[10] Ömer Egecioglu,et al. Dictionary Look-Up within Small Edit Distance , 2002, COCOON.

[11] Sunita Sarawagi,et al. ABSTRACT Efficient set joins on similarity predicates , 2004 .

[12] Surajit Chaudhuri,et al. An efficient filter for approximate membership checking , 2008, SIGMOD Conference.

[13] Raghav Kaushik,et al. Efficient exact set-similarity joins , 2006, VLDB.

[14] Divesh Srivastava,et al. Flexible String Matching Against Large Databases in Practice , 2004, VLDB.

[15] Burton H. Bloom,et al. Space/time trade-offs in hash coding with allowable errors , 1970, CACM.