Bayesian networks for supporting query processing over incomplete autonomous databases

As the information available to naïve users through autonomous data sources continues to increase, mediators become important to ensure that the wealth of information available is tapped effectively. A key challenge that these information mediators need to handle is the varying levels of incompleteness in the underlying databases in terms of missing attribute values. Existing approaches such as QPIAD aim to mine and use Approximate Functional Dependencies (AFDs) to predict and retrieve relevant incomplete tuples. These approaches make independence assumptions about missing values—which critically hobbles their performance when there are tuples containing missing values for multiple correlated attributes. In this paper, we present a principled probabilistic alternative that views an incomplete tuple as defining a distribution over the complete tuples that it stands for. We learn this distribution in terms of Bayesian networks. Our approach involves mining/“learning” Bayesian networks from a sample of the database, and using it to do both imputation (predict a missing value) and query rewriting (retrieve relevant results with incompleteness on the query-constrained attributes, when the data sources are autonomous). We present empirical studies to demonstrate that (i) at higher levels of incompleteness, when multiple attribute values are missing, Bayesian networks do provide a significantly higher classification accuracy and (ii) the relevant possible answers retrieved by the queries reformulated using Bayesian networks provide higher precision and recall than AFDs while keeping query processing costs manageable.

[1]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[2]  Dan Geiger,et al.  Identifying independence in bayesian networks , 1990, Networks.

[3]  Edward H. Shortliffe,et al.  Computer-based medical consultations, MYCIN , 1976 .

[4]  Subbarao Kambhampati,et al.  Query processing over incomplete autonomous databases: query rewriting using learned data dependencies , 2009, The VLDB Journal.

[5]  A. Hasman,et al.  Probabilistic reasoning in intelligent systems: Networks of plausible inference , 1991 .

[6]  Stuart J. Russell,et al.  Artificial Intelligence , 1986 .

[7]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[8]  Paola Sebastiani,et al.  Learning Bayesian Networks from Incomplete Databases , 1997, UAI.

[9]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[10]  Gustavo E. A. P. A. Batista,et al.  A Study of K-Nearest Neighbour as an Imputation Method , 2002, HIS.

[11]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[12]  Chian-Huei Wun,et al.  Using association rules for completing missing data , 2004, Fourth International Conference on Hybrid Intelligent Systems (HIS'04).

[13]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[14]  Subbarao Kambhampati,et al.  Query Processing over Incomplete Autonomous Databases , 2007, VLDB.

[15]  Rahul Gupta,et al.  Creating probabilistic databases from information extraction models , 2006, VLDB.

[16]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[17]  Antonio Salmerón,et al.  Multivariate Imputation of Qualitative Missing Data Using Bayesian Networks , 2004 .

[18]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data , 1988 .

[19]  Ross D. Shachter Bayes-Ball: The Rational Pastime (for Determining Irrelevance and Requisite Information in Belief Networks and Influence Diagrams) , 1998, UAI.

[20]  Hemal Khatri QUERY PROCESSING OVER INCOMPLETE AUTONOMOUS WEB DATABASES , 2006 .

[21]  Stuart C. Shapiro,et al.  Encyclopedia of artificial intelligence, vols. 1 and 2 (2nd ed.) , 1992 .

[22]  D. Heitjan,et al.  Distinguishing “Missing at Random” and “Missing Completely at Random” , 1996 .

[23]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[24]  David Maxwell Chickering,et al.  Learning Bayesian Networks: The Combination of Knowledge and Statistical Data , 1994, Machine Learning.

[25]  Richard E. Neapolitan,et al.  Learning Bayesian networks , 2007, KDD '07.

[26]  Alexes Butler,et al.  Microsoft Research Cambridge , 2013 .

[27]  Subbarao Kambhampati,et al.  Answering Imprecise Queries over Autonomous Web Databases , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[28]  Tom Minka,et al.  Expectation Propagation for approximate Bayesian inference , 2001, UAI.

[29]  Finn V. Jensen,et al.  Bayesian Networks and Decision Graphs , 2001, Statistics for Engineering and Information Science.

[30]  Rafael Rumí,et al.  Answering queries in hybrid Bayesian networks using importance sampling , 2012, Decis. Support Syst..

[31]  Kevin Murphy,et al.  Bayes net toolbox for Matlab , 1999 .

[32]  Gregory F. Cooper,et al.  The Computational Complexity of Probabilistic Inference Using Bayesian Belief Networks , 1990, Artif. Intell..

[33]  Paola Sebastiani,et al.  Robust Learning with Missing Data , 2001, Machine Learning.

[34]  E. Shortliffe Computer-based medical consultations: mycin (elsevier north holland , 1976 .

[35]  Ion Muslea,et al.  Online Query Relaxation via Bayesian Causal Structures Discovery , 2005, AAAI.

[36]  Kristian G. Olesen,et al.  An algebra of bayesian belief universes for knowledge-based systems , 1990, Networks.