Entropy based approximate querying and exploration of datacubes

Much research has been devoted to the efficient computation of relational aggregations and specifically the efficient execution of the datacube operation. We consider the inverse problem, that of deriving (approximately) the original data from the aggregates. We motivate this problem in the context of two specific application areas, that of approximate query answering and data analysis. We propose a framework based on the notion of information entropy that enables us to estimate the original values in a data set, given only aggregated information about it. We also describe an alternate utility of the proposed framework, that enables us to identify values that deviate from the underlying data distribution, suitable for data mining purposes. Finally, we present a detailed performance study of the algorithms using both real and synthetic data, highlighting the benefits of our approach as well as the efficiency of the proposed solutions.

[1]  Francesco M. Malvestuto,et al.  A universal-scheme approach to statistical databases containing homogeneous summary tables , 1993, TODS.

[2]  W. Deming,et al.  On a Least Squares Adjustment of a Sampled Frequency Table When the Expected Marginal Totals are Known , 1940 .

[3]  Stephen E. Fienberg,et al.  Discrete Multivariate Analysis: Theory and Practice , 1976 .

[4]  Abraham Silberschatz,et al.  What Makes Patterns Interesting in Knowledge Discovery Systems , 1996, IEEE Trans. Knowl. Data Eng..

[5]  Sridhar Ramaswamy,et al.  Join synopses for approximate query answering , 1999, SIGMOD '99.

[6]  Laks V. S. Lakshmanan,et al.  Snakes and sandwiches: optimal clustering strategies for a data warehouse , 1999, SIGMOD '99.

[7]  Paul S. Bradley,et al.  Compressed data cubes for OLAP aggregate query approximation on continuous dimensions , 1999, KDD '99.

[8]  Solomon Kullback,et al.  Information Theory and Statistics , 1970, The Mathematical Gazette.

[9]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[10]  Prabhakar Raghavan,et al.  A Linear Method for Deviation Detection in Large Databases , 1996, KDD.

[11]  V. Rich Personal communication , 1989, Nature.

[12]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[13]  Mark Sullivan,et al.  Quasi-cubes: exploiting approximations in multidimensional databases , 1997, SGMD.

[14]  Xintao Wu,et al.  Using Loglinear Models to Compress Datacube , 2000, Web-Age Information Management.

[15]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[16]  Christos Faloutsos,et al.  Recovering Information from Summary Data , 1997, VLDB.

[17]  Stanley F. Chen,et al.  A Gaussian Prior for Smoothing Maximum Entropy Models , 1999 .

[18]  Nimrod Megiddo,et al.  Discovery-Driven Exploration of OLAP Data Cubes , 1998, EDBT.

[19]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[20]  J. N. Kapur,et al.  Entropy optimization principles with applications , 1992 .

[21]  Soraya Abad-Mota,et al.  Approximate Query Processing with Summary Tables in Statistical Databases , 1992, EDBT.

[22]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[23]  Jeffrey Scott Vitter,et al.  Data cube approximation and histograms via wavelets , 1998, CIKM '98.

[24]  Dimitri P. Bertsekas,et al.  Constrained Optimization and Lagrange Multiplier Methods , 1982 .

[25]  Raymond T. Ng,et al.  Finding Intensional Knowledge of Distance-Based Outliers , 1999, VLDB.

[26]  Bruce G. Lindsay,et al.  Random sampling techniques for space efficient online computation of order statistics of large datasets , 1999, SIGMOD '99.

[27]  Abraham Silberschatz,et al.  View maintenance issues for the chronicle data model (extended abstract) , 1995, PODS.

[28]  Yannis E. Ioannidis,et al.  Balancing histogram optimality and practicality for query result size estimation , 1995, SIGMOD '95.

[29]  G LindsayBruce,et al.  Random sampling techniques for space efficient online computation of order statistics of large datasets , 1999 .