Tractable Query Answering Under Probabilistic Constraints

Large knowledge bases such as YAGO [SKW07] or DBpedia [BLK+09] can be used to answer queries in various domains. However, as they are automatically harvested from Web sources, they may be incomplete: important facts may be missing because they were not materialized in the original sources, or could not be extracted correctly. To mitigate this problem, approaches such as association rule mining [GTHS13] can extract statistical rules from the data which hold in most situations. For instance, people are usually nationals of the country where they are born; people who died in a place are often buried there. The application of such rules allows us to infer some of the missing facts, which may help mitigate the issue of incompleteness. Hence, we study the problem of query answering on large-scale knowledge bases under the constraints of such probabilistic deduction rules. As such rules only represent statistical tendencies, one needs to keep track of uncertainty on rule consequences when reasoning about them. There is a large body of work on probabilistic data management [SORK11]; yet, in that setting, many important tasks are intractable. For example, fixed conjunctive queries may be #P-hard [DS07] to evaluate on a probabilistic instance, even in the very simple tupleindependent database (TID) model [LLRS97]. To work around such hardness results, existing work has already investigated which query classes are tractable over all data instances, with a complex dichotomy between safe and unsafe queries [DS12]. Yet, there has been no attempt to generalize the observation that query evaluation is tractable, for all queries and for much more expressive query languages, on some instances such as probabilistic XML trees [CKS09]. Our work follows this intuition and revisits the probabilistic inference problem by studying instance classes that ensure tractability. More precisely, we study complexity as a function of instance treewidth, which is motivated by well-known tractability results on evaluating monadic second-order (MSO) queries on non-probabilistic bounded-treewidth instances [FFG02] and counting queries on bounded-treewidth graphs [ALS91]. This approach is also practically relevant, as the treewidth of real-world data is usually much less than its size. We thus show that, for the TID model, MSO query evaluation has linear data complexity if the treewidth of the instance is fixed. The TID model is not sufficient to represent the consequences of uncertain deduction rules, however: it assumes independence of all facts, whereas rule application imposes correlations between cause and consequence facts. Correlations are usually represented by probabilistic events shared between multiple facts, yet their presence makes it generally intractable to evaluate even the simplest queries, both in the relational [GT06] and XML [KS11] setting. However, we show that query evaluation is tractable if the instance has bounded width under a new notion of tree decomposition that accounts for probabilistic events; intuitively, we enforce their compatibility with the tree structure. This result implies, for example, that it is tractable to evaluate queries on the block-independent disjoint [BGMP92] probabilistic relational model, if the underlying instance has bounded treewidth in the usual sense and if the size of blocks is bounded by a constant. In the XML setting, it implies that query evaluation is tractable whenever there are only a bounded number of relevant events to propagate at any point along the tree. We last turn to our original problem of query evaluation on probabilistic instances under uncertain deduction rules: the goal is to determine the answers of a query on a knowledge base, annotated by

[1]  Detlef Seese,et al.  Easy Problems for Tree-Decomposable Graphs , 1991, J. Algorithms.

[2]  Fabian M. Suchanek,et al.  AMIE: association rule mining under incomplete evidence in ontological knowledge bases , 2013, WWW.

[3]  Jörg Flum,et al.  Query evaluation via tree-decompositions , 2001, JACM.

[4]  Laks V. S. Lakshmanan,et al.  ProbView: a flexible probabilistic database system , 1997, TODS.

[5]  Hector Garcia-Molina,et al.  The Management of Probabilistic Data , 1992, IEEE Trans. Knowl. Data Eng..

[6]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[7]  Yehoshua Sagiv,et al.  Running tree automata on probabilistic XML , 2009, PODS.

[8]  Dan Suciu,et al.  Probabilistic databases , 2011, SIGA.

[9]  Val Tannen,et al.  Models for Incomplete and Probabilistic Information , 2006, IEEE Data Eng. Bull..

[10]  Andrea Calì,et al.  Taming the Infinite Chase: Query Answering under Expressive Relational Constraints , 2008, Description Logics.

[11]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[12]  Adrian Onet,et al.  The Chase Procedure and its Applications in Data Exchange , 2013, Data Exchange, Information, and Streams.

[13]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[14]  Dan Suciu,et al.  The dichotomy of probabilistic inference for unions of conjunctive queries , 2012, JACM.