Statistics Collection in Oracle Spatial and Graph: Fast Histogram Construction for Complex Geometry Objects

Oracle Spatial and Graph is a geographic information system (GIS) which provides users the ability to store spatial data alongside conventional data in Oracle. As a result of the coexistence of spatial and other data, we observe a trend towards users performing increasingly complex queries which involve spatial as well as non-spatial predicates. Accurate selectivity values, especially for queries with multiple predicates requiring joins among numerous tables, are essential for the database optimizer to determine a good execution plan. For queries involving spatial predicates, this requires that reasonably accurate statistics collection has been performed on the spatial data. For extensible data cartridges such as Oracle Spatial and Graph, the optimizer expects to receive accurate predicate selectivity and cost values from functions implemented within the data cartridge. Although statistics collection for spatial data has been researched in academia for a few years; to the best of our knowledge, this is the first work to present spatial statistics collection implementation details for a commercial GIS database. In this paper, we describe our experiences with implementation of statistics collection methods for complex geometry objects within Oracle Spatial and Graph. Firstly, we exemplify issues with previous partitioning-based algorithms in presence of complex geometry objects and suggest enhancements which resolve the issues. Secondly, we propose a main memory implementation which not only speeds up the disk-based partitioning algorithms but also utilizes existing R-tree indexes to provide surprisingly accurate selectivity estimates. Last but not the least, we provide extensive experimental results and an example study which displays the efficacy of our approach on Oracle query performance.

[1]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[2]  Gerardo Beni,et al.  A Validity Measure for Fuzzy Clustering , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Luis Gravano,et al.  STHoles: a multidimensional workload-aware histogram , 2001, SIGMOD '01.

[4]  B Beauregard,et al.  Oracle Spatial and Graph , 2014 .

[5]  Benoît Dageville,et al.  Efficient and scalable statistics gathering for large databases in Oracle 11g , 2008, SIGMOD Conference.

[6]  Torsten Suel,et al.  On Rectangular Partitionings in Two Dimensions: Algorithms, Complexity, and Applications , 1999, ICDT.

[7]  Jeffrey F. Naughton,et al.  Sampling-Based Estimation of the Number of Distinct Values of an Attribute , 1995, VLDB.

[8]  Jeffrey F. Naughton,et al.  Accurate estimation of the cost of spatial selections , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[9]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[10]  Todd Eavis,et al.  Rk-hist: an r-tree based histogram for multi-dimensional selectivity estimation , 2007, CIKM '07.

[11]  Yannis E. Ioannidis,et al.  The History of Histograms (abridged) , 2003, VLDB.

[12]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[13]  Christos Faloutsos,et al.  On packing R-trees , 1993, CIKM '93.

[14]  Paul M. Aoki How to avoid building DataBlades(R) that know the value of everything and the cost of nothing , 1999, Proceedings. Eleventh International Conference on Scientific and Statistical Database Management.

[15]  Dimitrios Gunopulos,et al.  Approximating multi-dimensional aggregate range queries over real attributes , 2000, SIGMOD '00.

[16]  Sridhar Ramaswamy,et al.  Selectivity estimation in spatial databases , 1999, SIGMOD '99.

[17]  Kothuri Venkata Ravi Kanth,et al.  Quadtree and R-tree indexes in oracle spatial: a comparison using GIS data , 2002, SIGMOD '02.

[18]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[19]  Hans-Peter Kriegel,et al.  A Storage and Access Architecture for Efficient Query Processing in Spatial Database Systems , 1993, SSD.

[20]  Donald Gustafson,et al.  Fuzzy clustering with a fuzzy covariance matrix , 1978, 1978 IEEE Conference on Decision and Control including the 17th Symposium on Adaptive Processes.

[21]  Ning An,et al.  Incorporating updates in domain indexes: experiences with Oracle spatial R-trees , 2004, Proceedings. 20th International Conference on Data Engineering.

[22]  Jay Banerjee,et al.  Indexing medium-dimensionality data in Oracle , 1999, SIGMOD '99.

[23]  Christos Faloutsos,et al.  The R+-Tree: A Dynamic Index for Multi-Dimensional Objects , 1987, VLDB.

[24]  Peter J. Haas,et al.  Automated Statistics Collection in DB2 UDB , 2004, VLDB.

[25]  Peter J. Haas,et al.  ISOMER: Consistent Histogram Construction Using Query Feedback , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[26]  Yon Dohn Chung,et al.  Hierarchically organized skew-tolerant histograms for geographic data objects , 2010, SIGMOD Conference.

[27]  Phillip B. Gibbons Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports , 2001, VLDB.

[28]  Jeffrey F. Naughton,et al.  Practical selectivity estimation through adaptive sampling , 1990, SIGMOD '90.

[29]  P.J. Haas,et al.  Sampling-based selectivity estimation for joins using augmented frequent value statistics , 1995, Proceedings of the Eleventh International Conference on Data Engineering.