Statisticum: Data Statistics Management in SAP HANA

We introduce a new concept of leveraging traditional data statistics as dynamic data integrity constraints. These data statistics produce transient database constraints, which are valid as long as they can be proven to be consistent with the current data. We denote this type of data statistics by constraint data statistics, their properties needed for consistency checking by consistency metadata, and their implied integrity constraints by implied data statistics constraints (implied constraints for short). Implied constraints are valid integrity constraints which are powerful query optimization tools employed, just as traditional database constraints, in semantic query transformation (aka query reformulation), partition pruning, runtime optimization, and semi-join reduction, to name a few. To our knowledge, this is the first work introducing this novel and powerful concept of deriving implied integrity constraints from data statistics. We discuss theoretical aspects of the constraint data statistics concept and their integration into query processing. We present the current architecture of data statistics management in SAP HANA and detail how constraint data statistics are designed and integrated into this architecture. As an instantiation of this framework, we consider dynamic partition pruning for data aging scenarios. We discuss our current implementation for constraint data statistics objects in SAP HANA which can be used for dynamic partition pruning. We enumerate their properties and show how consistency checking for implied integrity constraints is supported in the data statistics architecture. Our experimental evaluations on the TPC-H benchmark and a real customer application confirm the effectiveness of the implied integrity constraints; (1) for 59% of TPC-H queries, constraint data statistics utilization results in pruning cold partitions and reducing memory consumption, and (2) we observe up to 3 orders of magnitude speed-up in query processing time, for a real customer running an S/4HANA application.

[1]  Guido Moerkotte,et al.  Small Materialized Aggregates: A Light Weight Index Structure for Data Warehousing , 1998, VLDB.

[2]  Anurag Gupta,et al.  Amazon Redshift and the Case for Simpler Data Warehouses , 2015, SIGMOD Conference.

[3]  Alister D'Costa,et al.  Data Statistics Adviser in Database Management Systems , 2017, SIGMOD 2017.

[4]  Norman May,et al.  Exploiting ordered dictionaries to efficiently construct histograms with q-error guarantees in SAP HANA , 2014, SIGMOD Conference.

[5]  Paul Brown,et al.  BHUNT: Automatic Discovery of Fuzzy Algebraic Constraints in Relational Data , 2003, VLDB.

[6]  Laks V. S. Lakshmanan,et al.  Discovering Conditional Functional Dependencies , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[7]  Jignesh M. Patel,et al.  Data Morphing: An Adaptive, Cache-Conscious Storage Technique , 2003, VLDB.

[8]  Volker Markl,et al.  Collecting and Maintaining Just-in-Time Statistics , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[9]  Surajit Chaudhuri,et al.  Automating Statistics Management for Query Optimizers , 2001, IEEE Trans. Knowl. Data Eng..

[10]  Alexander Zeier,et al.  HYRISE - A Main Memory Hybrid Storage Engine , 2010, Proc. VLDB Endow..

[11]  P. Patel-Schneider Towards Large-scale Schema And Ontology Matching , 2015 .

[12]  Carlo Curino,et al.  Schism , 2010, Proc. VLDB Endow..

[13]  Wenfei Fan,et al.  Dependencies revisited for improving data quality , 2008, PODS.

[14]  Liwen Sun,et al.  Fine-grained partitioning for aggressive data skipping , 2014, SIGMOD Conference.

[15]  Peter Bumbulis,et al.  Page As You Go: Piecewise Columnar Access In SAP HANA , 2016, SIGMOD Conference.

[16]  Badrish Chandramouli,et al.  READY: Completeness is in the Eye of the Beholder , 2017, CIDR.

[17]  Ashish Motivala,et al.  The Snowflake Elastic Data Warehouse , 2016, SIGMOD Conference.

[18]  Luis Carlos Erpen De Bona,et al.  Cubrick: Indexing Millions of Records per Second for Interactive Analytics , 2016, Proc. VLDB Endow..

[19]  Donald Kossmann,et al.  Adaptive Range Filters for Cold Data: Avoiding Trips to Siberia , 2013, Proc. VLDB Endow..

[20]  Liwen Sun,et al.  Skipping-oriented Partitioning for Columnar Layouts , 2016, Proc. VLDB Endow..

[21]  Alin Deutsch,et al.  Query reformulation with constraints , 2006, SGMD.

[22]  Vivek R. Narasayya,et al.  Integrating vertical and horizontal partitioning into automated physical database design , 2004, SIGMOD '04.

[23]  Shamkant B. Navathe,et al.  Vertical partitioning algorithms for database design , 1984, TODS.

[24]  Herodotos Herodotou,et al.  Query optimization techniques for partitioned tables , 2011, SIGMOD '11.

[25]  Goetz Graefe Fast Loads and Fast Queries , 2009, DaWaK.

[26]  Norman May,et al.  The SAP HANA Database -- An Architecture Overview , 2012, IEEE Data Eng. Bull..

[27]  Carlos Eduardo Scheidegger,et al.  Nanocubes for Real-Time Exploration of Spatiotemporal Datasets , 2013, IEEE Transactions on Visualization and Computer Graphics.