How to juggle columns: an entropy-based approach for table compression

Many relational databases exhibit complex dependencies between data attributes, caused either by the nature of the underlying data or by explicitly denormalized schemas. In data warehouse scenarios, calculated key figures may be materialized or hierarchy levels may be held within a single dimension table. Such column correlations and the resulting data redundancy may result in additional storage requirements. They may also result in bad query performance if inappropriate independence assumptions are made during query compilation. In this paper, we tackle the specific problem of detecting functional dependencies between columns to improve the compression rate for column-based database systems, which both reduces main memory consumption and improves query performance. Although a huge variety of algorithms have been proposed for detecting column dependencies in databases, we maintain that increased data volumes and recent developments in hardware architectures demand novel algorithms with much lower runtime overhead and smaller memory footprint. Our novel approach is based on entropy estimations and exploits a combination of sampling and multiple heuristics to render it applicable for a wide range of use cases. We demonstrate the quality of our approach by means of an implementation within the SAP NetWeaver Business Warehouse Accelerator. Our experiments indicate that our approach scales well with the number of columns and produces reliable dependence structure information. This both reduces memory consumption and improves performance for nontrivial queries.

[1]  Paul Brown,et al.  BHUNT: Automatic Discovery of Fuzzy Algebraic Constraints in Relational Data , 2003, VLDB.

[2]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[3]  Paul Brown,et al.  CORDS: automatic discovery of correlations and soft functional dependencies , 2004, SIGMOD '04.

[4]  Hannu Toivonen,et al.  Efficient discovery of functional and approximate dependencies using partitions , 1998, Proceedings 14th International Conference on Data Engineering.

[5]  Luis Gravano,et al.  STHoles: a multidimensional workload-aware histogram , 2001, SIGMOD '01.

[6]  Bin Yu,et al.  Coverage-adjusted entropy estimation. , 2007, Statistics in medicine.

[7]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[8]  Surajit Chaudhuri,et al.  Exploiting statistics on query expressions for optimization , 2002, SIGMOD '02.

[9]  Weiru Liu,et al.  Learning belief networks from data: an information theory based approach , 1997, CIKM '97.

[10]  S. Kullback,et al.  Information Theory and Statistics , 1959 .

[11]  Jeffrey Scott Vitter,et al.  SASH: A Self-Adaptive Histogram Set for Dynamically Changing Workloads , 2003, VLDB.

[12]  Jean-François Boulicaut,et al.  Towards the reverse engineering of renormalized relational databases , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[13]  Henri Theil,et al.  On the use of Information Theory Concepts in the Analysis of Financial Statements , 1969 .

[14]  Heikki Mannila,et al.  Dependency Inference , 1987, VLDB.

[15]  A. Chao,et al.  Nonparametric estimation of Shannon’s index of diversity when there are unseen species in sample , 2004, Environmental and Ecological Statistics.

[16]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .