Introduction and Motivation Driven by the need to learn from vast amounts of text data, efforts throughout natural language processing, information extraction, databases. and AI are coming together to build large-scale knowledge bases. Academic systems such as NELL [14], Reverb [7], Yago [11], and DeepDive [16] continuously crawl the web to extract relational information. Industry projects such as Microsoft’s Probase [18] or Google’s Knowledge Vault [6] similarly learn structured data from text to improve search products. Notably, such knowledge bases are inherently probabilistic and many of them [6, 16] are based on the foundations of tuple-independent probabilistic databases (PDBs) [17]. According to the PDB semantics, each database tuple is an independent Bernoulli random variable, and all other tuples have probability zero, enforcing a closed-world assumption (CWA) [15]. This paper revisits the choice for the CWA in probabilistic knowledge bases. We observe that the CWA is violated in their deployment, which makes it problematic to reason, learn, or mine on top of these databases. First, knowledge bases are part of a larger machine learning loop that continuously updates beliefs about facts based on new textual evidence. From a Bayesian learning perspective [2], this loop can only be principled when learned facts have an a priori non-zero probability. Hence, the CWA does not accurately represent this mode of operation and puts it on weak footing. Second, these issues are not temporary: it will never be possible to complete probabilistic knowledge bases of even the most trivial relations, as the memory requirements quickly become excessive. This already manifests today: statistical classifiers output facts at a high rate, but only the most probable ones make it into the knowledge base, and the rest is truncated, losing much of the statistical information. Third, query answering under the CWA does not take into account the effect the open world can have on the query probability. This makes it impossible to distinguish queries whose probability should intuitively differ. These issues stand in the way of some principled approaches to knowledge base completion and mining. We propose an alternative semantics for probabilistic knowledge bases to address these problems, which results in open-world PDBs (OpenPDBs). We show that OpenPDBs provide more meaningful answers. Finally, we pinpoint limitations of OpenPDBs and discuss ontology based data access (OBDA) as promising approach to further strengthen this framework.
[1]
Georg Gottlob,et al.
Query answering under probabilistic uncertainty in Datalog+ / − ontologies
,
2013,
Annals of Mathematics and Artificial Intelligence.
[2]
Dan Suciu,et al.
The dichotomy of probabilistic inference for unions of conjunctive queries
,
2012,
JACM.
[3]
Christopher De Sa,et al.
Incremental Knowledge Base Construction Using DeepDive
,
2015,
Proceedings of the VLDB Endowment International Conference on Very Large Data Bases.
[4]
Xinlei Chen,et al.
Never-Ending Learning
,
2012,
ECAI.
[5]
Oren Etzioni,et al.
Identifying Relations for Open Information Extraction
,
2011,
EMNLP.
[6]
Gerhard Weikum,et al.
YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia: Extended Abstract
,
2013,
IJCAI.
[7]
Dan Olteanu,et al.
Dichotomies for Queries with Negation in Probabilistic Databases
,
2016,
TODS.
[8]
Wei Zhang,et al.
Knowledge vault: a web-scale approach to probabilistic knowledge fusion
,
2014,
KDD.
[9]
Guy Van den Broeck,et al.
Open World Probabilistic Databases (Extended Abstract)
,
2016,
Description Logics.
[10]
Jean Christoph Jung,et al.
Ontology-Based Access to Probabilistic Data with OWL QL
,
2012,
SEMWEB.
[11]
Guy Van den Broeck,et al.
Understanding the Complexity of Lifted Inference and Asymmetric Weighted Model Counting
,
2014,
StarAI@AAAI.
[12]
Rafael Peñaloza,et al.
Probabilistic Query Answering in the Bayesian Description Logic BEL
,
2015,
SUM.
[13]
Radford M. Neal.
Pattern Recognition and Machine Learning
,
2007,
Technometrics.
[14]
R. Reiter.
On Closed World Data Bases
,
1987,
Logic and Data Bases.
[15]
Christopher De Sa,et al.
Incremental Knowledge Base Construction Using DeepDive
,
2015,
The VLDB Journal.
[16]
Isaac Levi,et al.
The Enterprise Of Knowledge
,
1980
.
[17]
Haixun Wang,et al.
Probase: a probabilistic taxonomy for text understanding
,
2012,
SIGMOD Conference.