Sample-based quality estimation of query results in relational database environments

The quality of data in relational databases is often uncertain, and the relationship between the quality of the underlying base tables and the set of potential query results, a type of information product (IP), that could be produced from them has not been fully investigated. This paper provides a basis for the systematic analysis of the quality of such IPs. This research uses the relational algebra framework to develop estimates for the quality of query results based on the quality estimates of samples taken from the base tables. Our procedure requires an initial sample from the base tables; these samples are then used for all possible information IPs. Each specific query governs the quality assessment of the relevant samples. By using the same sample repeatedly, our approach is relatively cost effective. We introduce the reference-table procedure, which can be used for quality estimation in general. In addition, for each of the basic algebraic operators, we discuss simpler procedures that may be applicable. Special attention is devoted to the join operation. We examine various, relevant statistical issues, including how to deal with the impact on quality of missing rows in base tables. Finally, we address several implementation issues related to sampling.

[1]  Donald P. Ballou,et al.  Cost/quality tradeoffs for control procedures in information systems , 1987 .

[2]  W. McNally,et al.  Tools and Methods for the Improvement of Quality , 1989 .

[3]  Robert H. Montgomery,et al.  Montgomery's Auditing , 1975 .

[4]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[5]  Carlo Batini,et al.  Completeness in the Relational Model: a Comprehensive Framework , 2004, ICIQ.

[6]  Varghese S. Jacob,et al.  Assessing Information Quality for the Composite Relational Operation Join , 2002, ICIQ.

[7]  Stephen E. Fienberg,et al.  An Adjusted Census in 1990: The Supreme Court Decides , 1996 .

[8]  Felix Naumann,et al.  Completeness of integrated information sources , 2004, Inf. Syst..

[9]  Viswanath Poosala,et al.  Congressional samples for approximate answering of group-by queries , 2000, SIGMOD '00.

[10]  Michael V. Mannino,et al.  Statistical profile estimation in database systems , 1988, CSUR.

[11]  M. Fisher,et al.  Rocket-Science Retailing Is Almost Here: Are You Ready? , 2000 .

[12]  Kenneth C. Laudon,et al.  Data quality and due process in large interorganizational record systems , 1986, CACM.

[13]  Christian P. Robert,et al.  Monte Carlo Statistical Methods (Springer Texts in Statistics) , 2005 .

[14]  A. Raman,et al.  Execution: The Missing Link in Retail Operations , 2001 .

[15]  Gordon B. Davis,et al.  Can Humans Detect Errors in Data? Impact of Base Rates, Incentives, and Goals , 1997, MIS Q..

[16]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[17]  Ron Weber,et al.  Information Systems Control and Audit , 1998 .

[18]  Sridhar Ramaswamy,et al.  Join synopses for approximate query answering , 1999, SIGMOD '99.

[19]  Anthony C. Klug Equivalence of Relational Algebra and Relational Calculus Query Languages Having Aggregate Functions , 1982, JACM.

[20]  G. B. Wetherill,et al.  Quality Control and Industrial Statistics , 1975 .

[21]  Norman P. Bresky,et al.  Tools and Methods for the Improvement of Quality , 1990 .

[22]  Peter Rob,et al.  Database systems : design, implementation, and management , 2000 .

[23]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[24]  Peter J. Haas,et al.  Ripple joins for online aggregation , 1999, SIGMOD '99.

[25]  Varghese S. Jacob,et al.  Assessing data quality for information products , 1999, ICIS.

[26]  Donald P. Ballou,et al.  Designing Information Systems to Optimize the Accuracy-Timeliness Tradeoff , 1995, Inf. Syst. Res..

[27]  Hoon Kim,et al.  Monte Carlo Statistical Methods , 2000, Technometrics.

[28]  K. Burnham,et al.  19 Role and use of composite sampling and capture-recapture sampling in ecological studies , 1988 .

[29]  Ken Orr,et al.  Data quality and systems theory , 1998, CACM.

[30]  Amihai Motro,et al.  Estimating the Quality of Databases , 1998, FQAS.