Selectivity Estimation for Fuzzy String Predicates in Large Data Sets

Many database applications have the emerging need to support fuzzy queries that ask for strings that are similar to a given string, such as "name similar to smith" and "telephone number similar to 412-0964." Query optimization needs the selectivity of such a fuzzy predicate, i.e., the fraction of records in the database that satisfy the condition. In this paper, we study the problem of estimating selectivities of fuzzy string predicates. We develop a novel technique, called SEPIA, to solve the problem. It groups strings into clusters, builds a histogram structure for each cluster, and constructs a global histogram for the database. It is based on the following intuition: given a query string q, a preselected string p in a cluster, and a string s in the cluster, based on the proximity between q and p, and the proximity between p and s, we can obtain a probability distribution from a global histogram about the similarity between q and s. We give a full specification of the technique using the edit distance function. We study challenges in adopting this technique, including how to construct the histogram structures, how to use them to do selectivity estimation, and how to alleviate the effect of non-uniform errors in the estimation. We discuss how to extend the techniques to other similarity functions. Our extensive experiments on real data sets show that this technique can accurately estimate selectivities of fuzzy string predicates.

[1]  William W. Cohen,et al.  A Comparison of String Metrics for Matching Names and Records , 2003 .

[2]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[3]  Christos Faloutsos,et al.  Similarity search without tears: the OMNI-family of all-purpose access methods , 2001, Proceedings 17th International Conference on Data Engineering.

[4]  Anthony K. H. Tung,et al.  Indexing Mixed Types for Approximate Retrieval , 2005, VLDB.

[5]  Jeffrey Scott Vitter,et al.  Dynamic Maintenance of Wavelet-Based Histograms , 2000, VLDB.

[6]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[7]  Luis Gravano,et al.  Selectivity estimation for string predicates: overcoming the underestimation problem , 2004, Proceedings. 20th International Conference on Data Engineering.

[8]  David J. DeWitt,et al.  Equi-depth multidimensional histograms , 1988, SIGMOD '88.

[9]  Tok Wang Ling,et al.  IntelliClean: a knowledge-based intelligent data cleaner , 2000, KDD '00.

[10]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[11]  Divesh Srivastava,et al.  Multi-Dimensional Substring Selectivity Estimation , 1999, VLDB.

[12]  Chen Li,et al.  Efficient record linkage in large data sets , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[13]  Zhenyu Liu,et al.  A probabilistic approach to metasearching with adaptive probing , 2004, Proceedings. 20th International Conference on Data Engineering.

[14]  Chen Li,et al.  NNH: Improving Performance of Nearest-Neighbor Searches Using Histograms , 2004, EDBT.

[15]  E. Chavez,et al.  Pivot selection techniques for proximity searching in metric spaces , 2001, SCCC 2001. 21st International Conference of the Chilean Computer Science Society.

[16]  P. Krishnan,et al.  Estimating alphanumeric selectivity in the presence of wildcards , 1996, SIGMOD '96.

[17]  Jeffrey F. Naughton,et al.  Estimating the Selectivity of XML Path Expressions for Internet Scale Applications , 2001, VLDB.

[18]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[19]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[20]  Christos Faloutsos,et al.  The power-method: a comprehensive estimation technique for multi-dimensional queries , 2003, CIKM '03.

[21]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[22]  Remco C. Veltkamp,et al.  Efficient image retrieval through vantage objects , 1999, Pattern Recognition.

[23]  Luis Gravano,et al.  Text joins in an RDBMS for web data integration , 2003, WWW '03.

[24]  Robert Kooi,et al.  The Optimization of Queries in Relational Databases , 1980 .

[25]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[26]  J. Gower,et al.  Metric and Euclidean properties of dissimilarity coefficients , 1986 .

[27]  Christos Faloutsos,et al.  Distance Exponent: A New Concept for Selectivity Estimation in Metric Trees , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[28]  S. Muthukrishnan,et al.  Selectively estimation for Boolean queries , 2000, PODS '00.

[29]  Z. Meral Özsoyoglu,et al.  Distance based indexing for string proximity search , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[30]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[31]  Jeffrey Scott Vitter,et al.  XPathLearner: An On-line Self-Tuning Markov Histogram for XML Path Selectivity Estimation , 2002, VLDB.

[32]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[33]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[34]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .