One-dimensional and multi-dimensional substring selectivity estimation

Abstract. With the increasing importance of XML, LDAP directories, and text-based information sources on the Internet, there is an ever-greater need to evaluate queries involving (sub)string matching. In many cases, matches need to be on multiple attributes/dimensions, with correlations between the multiple dimensions. Effective query optimization in this context requires good selectivity estimates. In this paper, we use pruned count-suffix trees (PSTs) as the basic data structure for substring selectivity estimation. For the 1-D problem, we present a novel technique called MO (Maximal Overlap). We then develop and analyze two 1-D estimation algorithms, MOC and MOLC, based on MO and a constraint-based characterization of all possible completions of a given PST. For the k-D problem, we first generalize PSTs to multiple dimensions and develop a space- and time-efficient probabilistic algorithm to construct k-D PSTs directly. We then show how to extend MO to multiple dimensions. Finally, we demonstrate, both analytically and experimentally, that MO is both practical and substantially superior to competing algorithms.

[1]  Aaron D. Wyner,et al.  Prediction and Entropy of Printed English , 1993 .

[2]  Jeffrey F. Naughton,et al.  Query size estimation by adaptive sampling (extended abstract) , 1990, PODS.

[3]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[4]  Divesh Srivastava,et al.  Substring selectivity estimation , 1999, PODS '99.

[5]  Raffaele Giancarlo A Generalization of the Suffix Tree to Square Matrices, with Applications , 1995, SIAM J. Comput..

[6]  Donald D. Chamberlin,et al.  Access Path Selection in a Relational Database Management System , 1989 .

[7]  P. Krishnan,et al.  Estimating alphanumeric selectivity in the presence of wildcards , 1996, SIGMOD '96.

[8]  Roberto Grossi,et al.  On the Construction of Classes of Suffix Trees for Square Matrices: Algorithms and Applications , 1995, ICALP.

[9]  Yannis E. Ioannidis,et al.  Balancing histogram optimality and practicality for query result size estimation , 1995, SIGMOD '95.

[10]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[11]  Jeffrey Scott Vitter,et al.  Selectivity estimation in the presence of alphanumeric correlations , 1997, Proceedings 13th International Conference on Data Engineering.

[12]  Yossi Matias,et al.  New sampling-based summary statistics for improving approximate query answers , 1998, SIGMOD '98.

[13]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[14]  Divesh Srivastava,et al.  Multi-Dimensional Substring Selectivity Estimation , 1999, VLDB.

[15]  B GibbonsPhillip,et al.  New sampling-based summary statistics for improving approximate query answers , 1998 .

[16]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[17]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[18]  Jeffrey F. Naughton,et al.  Query Size Estimation by Adaptive Sampling , 1995, J. Comput. Syst. Sci..

[19]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[20]  David J. DeWitt,et al.  Equi-Depth Histograms For Estimating Selectivity Factors For Multi-Dimensional Queries , 1988, SIGMOD Conference.

[21]  Yannis E. Ioannidis,et al.  Selectivity Estimation Without the Attribute Value Independence Assumption , 1997, VLDB.

[22]  David K. Smith Theory of Linear and Integer Programming , 1987 .

[23]  Yannis E. Ioannidis,et al.  Universality of Serial Histograms , 1993, VLDB.