Smoothing over Summary Information in Data Cubes

Decision support usuallyrequires drawing from a huge data warehouse some statisticalinformation that is interesting and useful to its users. A typicaldata model that supports the data warehouse is the multidimensionaldatabase, also known as a data cube. A data cube contains cells,each of which is associated with some summary information, or aggregate, that the decisions are to be based on. However, inreal-life databases, due to the nature of their contents, datadistribution tends to be clustered and sparse. The sparsity situationgets worse, in general, as the number of cells increases. Forthose cells that have support levels below a certain threshold,combining with adjacent cells is necessary to acquire sufficientsupport. Otherwise, incomplete or biased results could be deriveddue to lack of sufficient support.Our mainfocus in this paper is to find approximations for the missingor biased aggregates of those cells that have missing or lowsupport. We call this approximation process smoothing in thispaper. We propose a smooth function that can smooth nicely ona quantitative attribute while still being preserved locally.Our method is also adaptive to sudden changes of data distribution,called discontinuities, that inevitably occur in real-life data.

[1]  Francesco M. Malvestuto,et al.  A universal-scheme approach to statistical databases containing homogeneous summary tables , 1993, TODS.

[2]  V. S. Subrahmanian,et al.  Maintaining views incrementally , 1993, SIGMOD Conference.

[3]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[4]  Alex Alves Freitas,et al.  Mining Very Large Databases with Parallel Processing , 1997, The Kluwer International Series on Advances in Database Systems.

[5]  Christos Faloutsos,et al.  Recovering Information from Summary Data , 1997, VLDB.

[6]  Yasuhiko Morimoto,et al.  Mining optimized association rules for numeric attributes , 1996, J. Comput. Syst. Sci..

[7]  Demetri Terzopoulos,et al.  Regularization of Inverse Visual Problems Involving Discontinuities , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Abraham Silberschatz,et al.  View maintenance issues for the chronicle data model (extended abstract) , 1995, PODS.

[9]  Jeffrey D. Ullman,et al.  Implementing data cubes efficiently , 1996, SIGMOD '96.

[10]  Randy Kerber,et al.  ChiMerge: Discretization of Numeric Attributes , 1992, AAAI.

[11]  Ramakrishnan Srikant,et al.  Mining quantitative association rules in large relational tables , 1996, SIGMOD '96.

[12]  Tomasz Imielinski,et al.  An Interval Classifier for Database Mining Applications , 1992, VLDB.

[13]  Surajit Chaudhuri,et al.  An overview of data warehousing and OLAP technology , 1997, SGMD.

[14]  Marco Richeldi,et al.  Class-Driven Statistical Discretization of Continuous Attributes (Extended Abstract) , 1995, ECML.

[15]  Ashish Gupta,et al.  Generalized Projections: A Powerful Approach To Aggregation , 1995 .

[16]  George Colliat,et al.  OLAP, relational, and multidimensional database systems , 1996, SGMD.

[17]  Jeffrey F. Naughton,et al.  On the Computation of Multidimensional Aggregates , 1996, VLDB.

[18]  Yossi Matias,et al.  New sampling-based summary statistics for improving approximate query answers , 1998, SIGMOD '98.

[19]  Nimrod Megiddo,et al.  Range queries in OLAP data cubes , 1997, SIGMOD '97.

[20]  Spyros Makridakis,et al.  Forecasting Methods for Management , 1989 .

[21]  Renée J. Miller,et al.  Association rules over interval data , 1997, SIGMOD '97.

[22]  Yasuhiko Morimoto,et al.  Data mining using two-dimensional optimized association rules: scheme, algorithms, and visualization , 1996, SIGMOD '96.

[23]  Ken D. Sauer,et al.  A generalized Gaussian image model for edge-preserving MAP estimation , 1993, IEEE Trans. Image Process..

[24]  Jennifer Widom,et al.  Research problems in data warehousing , 1995, CIKM '95.

[25]  Kneale T. Marshall,et al.  Decision making and forecasting : with emphasis on model building and policy analysis , 1995 .

[26]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[27]  Jeffrey F. Naughton,et al.  An array-based algorithm for simultaneous multidimensional aggregates , 1997, SIGMOD '97.

[28]  Jiawei Han,et al.  Dynamic Generation and Refinement of Concept Hierarchies for Knowledge Discovery in Databases , 1994, KDD Workshop.

[29]  Ron Kohavi,et al.  Error-Based and Entropy-Based Discretization of Continuous Features , 1996, KDD.