SEPIA: estimating selectivities of approximate string predicates in large Databases

Many database applications have the emerging need to support approximate queries that ask for strings that are similar to a given string, such as “name similar to smith” and “telephone number similar to 412-0964”. Query optimization needs the selectivity of such an approximate predicate, i.e., the fraction of records in the database that satisfy the condition. In this paper, we study the problem of estimating selectivities of approximate string predicates. We develop a novel technique, called Sepia, to solve the problem. Given a bag of strings, our technique groups the strings into clusters, builds a histogram structure for each cluster, and constructs a global histogram. It is based on the following intuition: given a query string q, a preselected string p in a cluster, and a string s in the cluster, based on the proximity between q and p, and the proximity between p and s, we can obtain a probability distribution from a global histogram about the similarity between q and s. We give a full specification of the technique using the edit distance metric. We study challenges in adopting this technique, including how to construct the histogram structures, how to use them to do selectivity estimation, and how to alleviate the effect of non-uniform errors in the estimation. We discuss how to extend the techniques to other similarity functions. Our extensive experiments on real data sets show that this technique can accurately estimate selectivities of approximate string predicates.

[1]  Christos Faloutsos,et al.  Distance Exponent: A New Concept for Selectivity Estimation in Metric Trees , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[2]  S. Muthukrishnan,et al.  Selectively estimation for Boolean queries , 2000, PODS '00.

[3]  Christos Faloutsos,et al.  Similarity search without tears: the OMNI-family of all-purpose access methods , 2001, Proceedings 17th International Conference on Data Engineering.

[4]  Anthony K. H. Tung,et al.  Indexing Mixed Types for Approximate Retrieval , 2005, VLDB.

[5]  Jeffrey Scott Vitter,et al.  Dynamic Maintenance of Wavelet-Based Histograms , 2000, VLDB.

[6]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[7]  Z. Meral Özsoyoglu,et al.  Distance based indexing for string proximity search , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[8]  David J. DeWitt,et al.  Equi-depth multidimensional histograms , 1988, SIGMOD '88.

[9]  Tok Wang Ling,et al.  IntelliClean: a knowledge-based intelligent data cleaner , 2000, KDD '00.

[10]  P. Krishnan,et al.  Estimating alphanumeric selectivity in the presence of wildcards , 1996, SIGMOD '96.

[11]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[12]  Jeffrey Scott Vitter,et al.  XPathLearner: An On-line Self-Tuning Markov Histogram for XML Path Selectivity Estimation , 2002, VLDB.

[13]  Ricardo A. Baeza-Yates,et al.  Searching in metric spaces , 2001, CSUR.

[14]  Chen Li,et al.  Selectivity Estimation for Fuzzy String Predicates in Large Data Sets , 2005, VLDB.

[15]  Jeffrey F. Naughton,et al.  Estimating the Selectivity of XML Path Expressions for Internet Scale Applications , 2001, VLDB.

[16]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[17]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[18]  J. Gower,et al.  Metric and Euclidean properties of dissimilarity coefficients , 1986 .

[19]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[20]  Divesh Srivastava,et al.  Multi-Dimensional Substring Selectivity Estimation , 1999, VLDB.

[21]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[22]  Hanan Samet,et al.  Index-driven similarity search in metric spaces (Survey Article) , 2003, TODS.

[23]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[24]  Chen Li,et al.  NNH: Improving Performance of Nearest-Neighbor Searches Using Histograms , 2004, EDBT.

[25]  E. Chavez,et al.  Pivot selection techniques for proximity searching in metric spaces , 2001, SCCC 2001. 21st International Conference of the Chilean Computer Science Society.

[26]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[27]  Pavel Zezula,et al.  A cost model for similarity queries in metric spaces , 1998, PODS '98.

[28]  Luis Gravano,et al.  Selectivity estimation for string predicates: overcoming the underestimation problem , 2004, Proceedings. 20th International Conference on Data Engineering.

[29]  Esko Ukkonen,et al.  Algorithms for Approximate String Matching , 1985, Inf. Control..

[30]  Christos Faloutsos,et al.  The power-method: a comprehensive estimation technique for multi-dimensional queries , 2003, CIKM '03.

[31]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[32]  Luis Gravano,et al.  Text joins in an RDBMS for web data integration , 2003, WWW '03.

[33]  Robert Kooi,et al.  The Optimization of Queries in Relational Databases , 1980 .

[34]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[35]  Remco C. Veltkamp,et al.  Efficient image retrieval through vantage objects , 1999, Pattern Recognit..

[36]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[37]  William W. Cohen,et al.  A Comparison of String Metrics for Matching Names and Records , 2003 .

[38]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[39]  Chen Li,et al.  Efficient record linkage in large data sets , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[40]  Zhenyu Liu,et al.  A probabilistic approach to metasearching with adaptive probing , 2004, Proceedings. 20th International Conference on Data Engineering.

[41]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[42]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[43]  Divesh Srivastava,et al.  Substring selectivity estimation , 1999, PODS '99.