Query Processing over Incomplete Autonomous Databases

Incompleteness due to missing attribute values (aka "null values") is very common in autonomous web databases, on which user accesses are usually supported through mediators. Traditional query processing techniques that focus on the strict soundness of answer tuples often ignore tuples with critical missing attributes, even if they wind up being relevant to a user query. Ideally we would like the mediator to retrieve such possible answers and gauge their relevance by accessing their likelihood of being pertinent answers to the query. The autonomous nature of web databases poses several challenges in realizing this objective. Such challenges include the restricted access privileges imposed on the data, the limited support for query patterns, and the bounded pool of database and network resources in the web environment. We introduce a novel query rewriting and optimization framework QPIAD that tackles these challenges. Our technique involves reformulating the user query based on mined correlations among the database attributes. The reformulated queries are aimed at retrieving the relevant possible answers in addition to the certain answers. QPIAD is able to gauge the relevance of such queries allowing tradeoffs in reducing the costs of database query processing and answer transmission. To support this framework, we develop methods for mining attribute correlations (in terms of Approximate Functional Dependencies), value distributions (in the form of Naive Bayes Classifiers), and selectivity estimates. We present empirical studies to demonstrate that our approach is able to effectively retrieve relevant possible answers with high precision, high recall, and manageable cost.

[1]  Subbarao Kambhampati,et al.  Answering Imprecise Queries over Autonomous Web Databases , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[2]  Chian-Huei Wun,et al.  Using association rules for completing missing data , 2004, Fourth International Conference on Hybrid Intelligent Systems (HIS'04).

[3]  Dan Suciu,et al.  Foundations of probabilistic answers to queries , 2005, SIGMOD '05.

[4]  Alon Y. Halevy,et al.  Adapting to source properties in processing data integration queries , 2004, SIGMOD '04.

[6]  Leopoldo E. Bertossi,et al.  Consistent query answering in databases , 2006, SGMD.

[7]  Wenfei Fan,et al.  Conditional Functional Dependencies for Data Cleaning , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[8]  Jean-Marc Petit,et al.  Functional and approximate dependency mining: database and FCA points of view , 2002, J. Exp. Theor. Artif. Intell..

[9]  Rosine Cicchetti,et al.  FUN: An Efficient Algorithm for Mining Functional and Embedded Dependencies , 2001, ICDT.

[10]  Tomasz Imielinski,et al.  Incomplete Information in Relational Databases , 1984, JACM.

[11]  Wenfei Fan,et al.  Dependencies revisited for improving data quality , 2008, PODS.

[12]  C. Koch,et al.  Worlds and Beyond : Effcient Representation and Processing of Incomplete Information , 2007 .

[13]  Gustavo E. A. P. A. Batista,et al.  A Study of K-Nearest Neighbour as an Imputation Method , 2002, HIS.

[14]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[15]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[16]  Paul Brown,et al.  CORDS: automatic discovery of correlations and soft functional dependencies , 2004, SIGMOD '04.

[17]  Gultekin Özsoyoglu,et al.  Incomplete Relational Database Models Based on Intervals , 1993, IEEE Trans. Knowl. Data Eng..

[18]  Val Tannen,et al.  Models for Incomplete and Probabilistic Information , 2006, IEEE Data Eng. Bull..

[19]  Leonid Libkin,et al.  Data exchange and incomplete information , 2006, PODS '06.

[20]  David Heckerman,et al.  A Tutorial on Learning with Bayesian Networks , 1999, Innovations in Bayesian Networks.

[21]  Paola Sebastiani,et al.  Robust Learning with Missing Data , 2001, Machine Learning.

[22]  Witold Lipski,et al.  On semantic issues connected with incomplete information databases , 1979, ACM Trans. Database Syst..

[23]  Surajit Chaudhuri,et al.  Leveraging aggregate constraints for deduplication , 2007, SIGMOD '07.

[24]  Hemal Khatri QUERY PROCESSING OVER INCOMPLETE AUTONOMOUS WEB DATABASES , 2006 .

[25]  Pedro M. Domingos,et al.  Reconciling schemas of disparate data sources: a machine-learning approach , 2001, SIGMOD '01.

[26]  Dan Olteanu,et al.  $${10^{(10^{6})}}$$ worlds and beyond: efficient representation and processing of incomplete information , 2006, 2007 IEEE 23rd International Conference on Data Engineering.

[27]  Jennifer Widom,et al.  Trio: A System for Integrated Management of Data, Accuracy, and Lineage , 2004, CIDR.

[28]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[29]  Subbarao Kambhampati,et al.  QPIAD: Query Processing over Incomplete Autonomous Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[30]  Jayant Madhavan,et al.  Structured Data Meets the Web: A Few Observations , 2006, IEEE Data Eng. Bull..

[31]  Hannu Toivonen,et al.  TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies , 1999, Comput. J..

[32]  Jean-Marc Petit,et al.  Efficient Discovery of Functional Dependencies and Armstrong Relations , 2000, EDBT.

[33]  Walid G. Aref,et al.  Supporting top-kjoin queries in relational databases , 2004, The VLDB Journal.

[34]  Howard J. Hamilton,et al.  Mining functional dependencies from data , 2007, Data Mining and Knowledge Discovery.

[35]  Sunil Prabhakar,et al.  Evaluating probabilistic queries over imprecise data , 2003, SIGMOD '03.

[36]  Dan Suciu,et al.  Management of probabilistic data: foundations and challenges , 2007, PODS '07.

[37]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[38]  Ion Muslea,et al.  Online Query Relaxation via Bayesian Causal Structures Discovery , 2005, AAAI.

[39]  Edward L. Robertson,et al.  FastFDs: A Heuristic-Driven, Depth-First Algorithm for Mining Functional Dependencies from Relation Instances - Extended Abstract , 2001, DaWaK.

[40]  Bei Yu,et al.  On generating near-optimal tableaux for conditional functional dependencies , 2008, Proc. VLDB Endow..

[41]  Dan Olteanu,et al.  10106 Worlds and Beyond: Efficient Representation and Processing of Incomplete Information , 2007, ICDE.

[42]  Hannu Toivonen,et al.  Efficient discovery of functional and approximate dependencies using partitions , 1998, Proceedings 14th International Conference on Data Engineering.

[43]  Val Tannen,et al.  An Equational Chase for Path-Conjunctive Queries, Constraints, and Views , 1999, ICDT.

[44]  Heikki Mannila,et al.  Approximate Dependency Inference from Relations , 1992, ICDT.

[45]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[46]  Rahul Gupta,et al.  Creating probabilistic databases from information extraction models , 2006, VLDB.

[47]  Aravind Kalavagattu MINING APPROXIMATE FUNCTIONAL DEPENDENCIES AS CONDENSED REPRESENTATIONS OF ASSOCIATION RULES , 2008 .

[48]  Jennifer Widom,et al.  Working Models for Uncertain Data , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[49]  T. S. Jayram,et al.  OLAP over uncertain and imprecise data , 2007, The VLDB Journal.

[50]  Renée J. Miller,et al.  ConQuer: efficient management of inconsistent databases , 2005, SIGMOD '05.

[51]  Renée J. Miller,et al.  Clean Answers over Dirty Databases: A Probabilistic Approach , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[52]  Maurizio Lenzerini,et al.  Source inconsistency and incompleteness in data integration , 2002, KRDB.

[53]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[54]  Dan Suciu,et al.  Answering Queries from Statistics and Probabilistic Views , 2005, VLDB.