QPIAD: Query Processing over Incomplete Autonomous Databases

Incompleteness due to missing attribute values (aka "null values") is very common in autonomous Web databases, on which user accesses are usually supported through mediators. Traditional query processing techniques that focus on the strict soundness of answer tuples often ignore tuples with critical missing attributes, even if they wind up being relevant to a user query. Ideally we would like the mediator to retrieve such relevant uncertain answers and gauge their relevance by accessing their likelihood of being relevant answers to the query. However, the autonomous nature of the databases poses several challenges, such as the restricted access privileges, limited query patterns, and sensitivity of database and network resource consumption in the Web environment. We introduce a novel query rewriting and optimization framework QPIAD that tackles these challenges to retrieve relevant uncertain answers. Our technique involves reformulating the user query based on approximate functional dependencies (AFDs) among the database attributes and ranking these queries using value distributions learned from naive Bayes classifiers. Empirical studies demonstrate the effectiveness of our approach in retrieving relevant uncertain answers with high precision, high recall and manageable cost.

[1]  David Heckerman,et al.  A Tutorial on Learning with Bayesian Networks , 1999, Innovations in Bayesian Networks.

[2]  Paola Sebastiani,et al.  Robust Learning with Missing Data , 2001, Machine Learning.

[3]  Hannu Toivonen,et al.  Efficient discovery of functional and approximate dependencies using partitions , 1998, Proceedings 14th International Conference on Data Engineering.

[4]  Val Tannen,et al.  An Equational Chase for Path-Conjunctive Queries, Constraints, and Views , 1999, ICDT.

[5]  Chian-Huei Wun,et al.  Using association rules for completing missing data , 2004, Fourth International Conference on Hybrid Intelligent Systems (HIS'04).

[6]  Heikki Mannila,et al.  Approximate Dependency Inference from Relations , 1992, ICDT.

[7]  Jennifer Widom,et al.  Trio: A System for Integrated Management of Data, Accuracy, and Lineage , 2004, CIDR.

[8]  Tomasz Imielinski,et al.  Incomplete Information in Relational Databases , 1984, JACM.

[9]  Gultekin Özsoyoglu,et al.  Incomplete Relational Database Models Based on Intervals , 1993, IEEE Trans. Knowl. Data Eng..

[10]  Hemal Khatri QUERY PROCESSING OVER INCOMPLETE AUTONOMOUS WEB DATABASES , 2006 .

[11]  Sunil Prabhakar,et al.  Evaluating probabilistic queries over imprecise data , 2003, SIGMOD '03.

[12]  Alon Y. Halevy,et al.  Adapting to source properties in processing data integration queries , 2004, SIGMOD '04.

[13]  Pedro M. Domingos,et al.  Reconciling schemas of disparate data sources: a machine-learning approach , 2001, SIGMOD '01.

[14]  Dan Suciu,et al.  Foundations of probabilistic answers to queries , 2005, SIGMOD '05.

[15]  Gustavo E. A. P. A. Batista,et al.  A Study of K-Nearest Neighbour as an Imputation Method , 2002, HIS.

[16]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[17]  Renée J. Miller,et al.  Clean Answers over Dirty Databases: A Probabilistic Approach , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[18]  Maurizio Lenzerini,et al.  Source inconsistency and incompleteness in data integration , 2002, KRDB.

[19]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[20]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[21]  Ion Muslea,et al.  Online Query Relaxation via Bayesian Causal Structures Discovery , 2005, AAAI.

[22]  Witold Lipski,et al.  On semantic issues connected with incomplete information databases , 1979, ACM Trans. Database Syst..

[23]  Jennifer Widom,et al.  Working Models for Uncertain Data , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[24]  Renée J. Miller,et al.  ConQuer: efficient management of inconsistent databases , 2005, SIGMOD '05.