Efficient and scalable statistics gathering for large databases in Oracle 11g

Large tables are often decomposed into smaller pieces called partitions in order to improve query performance and ease the data management. Query optimizers rely on both the statistics of the entire table and the statistics of the individual partitions to select a good execution plan for a SQL statement. In Oracle 10g, we scan the entire table twice, one pass for gathering the table level statistics and the other pass for gathering the partition level statistics. A consequence of this gathering method is that, when the data in some partitions change, not only do we need to scan the changed partitions to gather the partition level statistics, but also we have to scan the entire table again to gather the table level statistics. Oracle 11g adopts a one-pass distinct sampling based method which can accurately derive the table level statistics from the partition level statistics. When data change, Oracle only re-gathers the statistics for the changed partitions and then derives the table level statistics without touching the unchanged partitions. To the best of our knowledge, although the one-pass distinct sampling has been researched in academia for some years, Oracle is the first commercial database that implements the technique. We have performed extensive experiments on both benchmark data and real customer data. Our experiments illustrate the this new method is highly accurate and has significantly better performance than the old method used in Oracle 10g.

[1]  Meikel Pöss,et al.  New TPC benchmarks for decision support and web commerce , 2000, SGMD.

[2]  Rajeev Motwani,et al.  Towards estimation error guarantees for distinct values , 2000, PODS.

[3]  Peter J. Haas,et al.  Maintaining bernoulli samples over evolving multisets , 2007, PODS '07.

[4]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[5]  Jeffrey F. Naughton,et al.  Sampling-Based Estimation of the Number of Distinct Values of an Attribute , 1995, VLDB.

[6]  Christian S. Jensen,et al.  Temporal Data Management , 1999, IEEE Trans. Knowl. Data Eng..

[7]  Kyu-Young Whang,et al.  A linear-time probabilistic counting algorithm for database applications , 1990, TODS.

[8]  Patricia G. Selinger,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.

[9]  Peter J. Haas,et al.  Automated Statistics Collection in DB2 UDB , 2004, VLDB.

[10]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[11]  Meikel Pöss,et al.  MUDD: a multi-dimensional data generator , 2004, WOSP '04.

[12]  Wen-Chi Hou,et al.  On Estimating COUNT, SUM, and AVERAGE , 1991, International Conference on Database and Expert Systems Applications.

[13]  Wen-Chi Hou,et al.  Error-constrained COUNT query evaluation in relational databases , 1991, SIGMOD '91.

[14]  Wen-Chi Hou,et al.  Processing aggregate relational queries with hard time constraints , 1989, SIGMOD '89.

[15]  Phillip B. Gibbons Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports , 2001, VLDB.

[16]  Peter J. Haas,et al.  On synopses for distinct-value estimation under multiset operations , 2007, SIGMOD '07.

[17]  Peter J. Haas,et al.  A dip in the reservoir: maintaining sample synopses of evolving datasets , 2006, VLDB.

[18]  Raghunath Othayoth Nambiar,et al.  The making of TPC-DS , 2006, VLDB.

[19]  Edith Cohen,et al.  Size-Estimation Framework with Applications to Transitive Closure and Reachability , 1997, J. Comput. Syst. Sci..