Fast Indexes and Algorithms for Set Similarity Selection Queries

Data collections often have inconsistencies that arise due to a variety of reasons, and it is desirable to be able to identify and resolve them efficiently. Set similarity queries are commonly used in data cleaning for matching similar data. In this work we concentrate on set similarity selection queries: Given a query set, retrieve all sets in a collection with similarity greater than some threshold. Various set similarity measures have been proposed in the past for data cleaning purposes. In this work we concentrate on weighted similarity functions like TF/IDF, and introduce variants that are well suited for set similarity selections in a relational database context. These variants have special semantic properties that can be exploited to design very efficient index structures and algorithms for answering queries efficiently. We present modifications of existing technologies to work for set similarity selection queries. We also introduce three novel algorithms based on the Threshold Algorithm, that exploit the semantic properties of the new similarity measures to achieve the best performance in theory and practice.

[1]  Udi Manber,et al.  Fast text searching: allowing errors , 1992, CACM.

[2]  Gonzalo Navarro,et al.  Faster Approximate String Matching , 1999, Algorithmica.

[3]  Ambuj K. Singh,et al.  Efficient Index Structures for String Databases , 2001, VLDB.

[4]  M. Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS '01.

[5]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[6]  Dimitrios Gunopulos,et al.  Efficient and tumble similar set retrieval , 2001, SIGMOD '01.

[7]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[8]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[9]  Nikos Mamoulis,et al.  Efficient processing of joins on set-valued attributes , 2003, SIGMOD '03.

[10]  Z. Meral Özsoyoglu,et al.  Distance based indexing for string proximity search , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[11]  Luis Gravano,et al.  Text joins in an RDBMS for web data integration , 2003, WWW '03.

[12]  Sunita Sarawagi,et al.  Efficient set joins on similarity predicates , 2004, SIGMOD '04.

[13]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[14]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[15]  Divesh Srivastava,et al.  Benchmarking declarative approximate selection predicates , 2007, SIGMOD '07.

[16]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.