论文信息 - Aggregate Query Processing on Incomplete Data

Aggregate Query Processing on Incomplete Data

Incomplete data has been a longstanding issue in database community, and yet the subject is poorly handled by both theory and practice. In this paper, we propose to directly estimate the aggregate query result on incomplete data, rather than imputing the missing values. An interval estimation, composed of the upper and lower bound of aggregate query results among all possible interpretation of missing values, are presented to the end-users. The ground-truth aggregate result is guaranteed to be among the interval. Experimental results are consistent with the theoretical results, and suggest that the estimation is invaluable to better assess the results of aggregate queries on incomplete data.

[1] Witold Lipski,et al. On semantic issues connected with incomplete information databases , 1979, ACM Trans. Database Syst..

[2] Sunil Prabhakar,et al. ERACER: a database approach for statistical inference and data cleaning , 2010, SIGMOD Conference.

[3] E. F. Codd,et al. Extending the database relational model to capture more meaning , 1979, ACM Trans. Database Syst..

[4] D. Rubin,et al. Statistical Analysis with Missing Data. , 1989 .

[5] Raymond Reiter. On Closed World Data Bases , 1977, Logic and Data Bases.

[6] Ahmed K. Elmagarmid,et al. NADEEF: A Generalized Data Cleaning System , 2013, Proc. VLDB Endow..

[7] José María Sarabia,et al. Bayesian estimation of incomplete data using conditionally specified priors , 2015, Commun. Stat. Simul. Comput..

[8] Erhard Rahm,et al. Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[9] Leonid Libkin,et al. Correctness of SQL Queries on Databases with Nulls , 2017, SGMD.

[10] Wenfei Fan,et al. Capturing missing tuples and missing values , 2010, PODS.

[11] Eyke Hüllermeier,et al. Statistical Inference for Incomplete Ranking Data: The Case of Rank-Dependent Coarsening , 2017, ICML.