论文信息 - One-dimensional and multi-dimensional substring selectivity estimation - 字舞流文

One-dimensional and multi-dimensional substring selectivity estimation

Abstract. With the increasing importance of XML, LDAP directories, and text-based information sources on the Internet, there is an ever-greater need to evaluate queries involving (sub)string matching. In many cases, matches need to be on multiple attributes/dimensions, with correlations between the multiple dimensions. Effective query optimization in this context requires good selectivity estimates. In this paper, we use pruned count-suffix trees (PSTs) as the basic data structure for substring selectivity estimation. For the 1-D problem, we present a novel technique called MO (Maximal Overlap). We then develop and analyze two 1-D estimation algorithms, MOC and MOLC, based on MO and a constraint-based characterization of all possible completions of a given PST. For the k-D problem, we first generalize PSTs to multiple dimensions and develop a space- and time-efficient probabilistic algorithm to construct k-D PSTs directly. We then show how to extend MO to multiple dimensions. Finally, we demonstrate, both analytically and experimentally, that MO is both practical and substantially superior to competing algorithms.

Divesh Srivastava | H. V. Jagadish | Raymond T. Ng | Olga Kapitskaia | R. Ng | D. Srivastava | H. Jagadish | Olga Kapitskaia

[1] Aaron D. Wyner,et al. Prediction and Entropy of Printed English , 1993 .

[2] Jeffrey F. Naughton,et al. Query size estimation by adaptive sampling (extended abstract) , 1990, PODS.

[3] Torsten Suel,et al. Optimal Histograms with Quality Guarantees , 1998, VLDB.

[4] Divesh Srivastava,et al. Substring selectivity estimation , 1999, PODS '99.

[5] Raffaele Giancarlo. A Generalization of the Suffix Tree to Square Matrices, with Applications , 1995, SIAM J. Comput..

[6] Donald D. Chamberlin,et al. Access Path Selection in a Relational Database Management System , 1989 .

[7] P. Krishnan,et al. Estimating alphanumeric selectivity in the presence of wildcards , 1996, SIGMOD '96.

[8] Roberto Grossi,et al. On the Construction of Classes of Suffix Trees for Square Matrices: Algorithms and Applications , 1995, ICALP.

[9] Yannis E. Ioannidis,et al. Balancing histogram optimality and practicality for query result size estimation , 1995, SIGMOD '95.

[10] Edward M. McCreight,et al. A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[11] Jeffrey Scott Vitter,et al. Selectivity estimation in the presence of alphanumeric correlations , 1997, Proceedings 13th International Conference on Data Engineering.

[12] Yossi Matias,et al. New sampling-based summary statistics for improving approximate query answers , 1998, SIGMOD '98.

[13] Salvatore J. Stolfo,et al. The merge/purge problem for large databases , 1995, SIGMOD '95.

[14] Divesh Srivastava,et al. Multi-Dimensional Substring Selectivity Estimation , 1999, VLDB.

[15] B GibbonsPhillip,et al. New sampling-based summary statistics for improving approximate query answers , 1998 .

[16] Peter Weiner,et al. Linear Pattern Matching Algorithms , 1973, SWAT.

[17] Peter J. Haas,et al. Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[18] Jeffrey F. Naughton,et al. Query Size Estimation by Adaptive Sampling , 1995, J. Comput. Syst. Sci..

[19] Claude E. Shannon,et al. Prediction and Entropy of Printed English , 1951 .

[20] David J. DeWitt,et al. Equi-Depth Histograms For Estimating Selectivity Factors For Multi-Dimensional Queries , 1988, SIGMOD Conference.

[21] Yannis E. Ioannidis,et al. Selectivity Estimation Without the Attribute Value Independence Assumption , 1997, VLDB.

[22] David K. Smith. Theory of Linear and Integer Programming , 1987 .

[23] Yannis E. Ioannidis,et al. Universality of Serial Histograms , 1993, VLDB.