Data requirements in statistical decision support systems: formulation and some results in choosing summaries

The problem of determining data requirements in cases where statistical query answers are desired is studied. Specifically, we consider the value of storing aggregate data that can be used to speed up answering such queries, but at the potential costs of incomplete information due to either estimation error or staleness, as well as increased costs of update. We formulate the overall optimization problem for design, and decompose it into several subproblems that can be separately addressed. Two of these subproblems are the choice of update method, and choice of aggregates. Qualitative results are given regarding the selection of update policy, and design heuristics, based on numerical experiments, are given for single-attribute Legendre polynomial aggregates. Multivariate Legendre aggregates are also discussed, and suggestions for future research are given.