Knowledge expansion over probabilistic knowledge bases

Information extraction and human collaboration techniques are widely applied in the construction of web-scale knowledge bases. However, these knowledge bases are often incomplete or uncertain. In this paper, we present ProbKB, a probabilistic knowledge base designed to infer missing facts in a scalable, probabilistic, and principled manner using a relational DBMS. The novel contributions we make to achieve scalability and high quality are: 1) We present a formal definition and a novel relational model for probabilistic knowledge bases. This model allows an efficient SQL-based inference algorithm for knowledge expansion that applies inference rules in batches; 2) We implement ProbKB on massive parallel processing databases to achieve further scalability; and 3) We combine several quality control methods that identify erroneous rules, facts, and ambiguous entities to improve the precision of inferred facts. Our experiments show that ProbKB system outperforms the state-of-the-art inference engine in terms of both performance and quality.

[1]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[2]  Kun Li,et al.  The MADlib Analytics Library or MAD Skills, the SQL , 2012, Proc. VLDB Endow..

[3]  Joseph M. Hellerstein,et al.  MAD Skills: New Analysis Practices for Big Data , 2009, Proc. VLDB Endow..

[4]  Matthew Richardson,et al.  Markov logic networks , 2006, Machine Learning.

[5]  Christopher Ré,et al.  Towards a unified architecture for in-RDBMS analytics , 2012, SIGMOD Conference.

[6]  Pedro M. Domingos,et al.  Memory-Efficient Inference in Relational Domains , 2006, AAAI.

[7]  Christopher Ré,et al.  Towards high-throughput gibbs sampling at scale: a study across storage managers , 2013, SIGMOD '13.

[8]  Subramanian Arumugam,et al.  The DataPath system: a data-centric analytic processing engine for large data warehouses , 2010, SIGMOD Conference.

[9]  Pedro M. Domingos,et al.  Structure learning in markov logic networks , 2010 .

[10]  Pedro M. Domingos,et al.  Sound and Efficient Inference with Probabilistic and Deterministic Dependencies , 2006, AAAI.

[11]  James A. Larson,et al.  Physical Database Design , 2001, High-Performance Web Databases.

[12]  Christopher Ré,et al.  Tuffy: Scaling up Statistical Inference in Markov Logic Networks using an RDBMS , 2011, Proc. VLDB Endow..

[13]  Joseph E. Gonzalez,et al.  GraphLab: A New Parallel Framework for Machine Learning , 2010 .

[14]  Alfred Horn,et al.  On sentences which are true of direct unions of algebras , 1951, Journal of Symbolic Logic.

[15]  Jennifer Widom,et al.  Database Systems: The Complete Book , 2001 .

[16]  Jesse Davis,et al.  Markov Network Structure Learning: A Randomized Feature Generation Approach , 2012, AAAI.

[17]  Oren Etzioni,et al.  Identifying Functional Relations in Web Text , 2010, EMNLP.

[18]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[19]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[20]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[21]  Oren Etzioni,et al.  Open Language Learning for Information Extraction , 2012, EMNLP.

[22]  Oren Etzioni,et al.  Learning First-Order Horn Clauses from Web Text , 2010, EMNLP.

[23]  Arthur Gretton,et al.  Parallel Gibbs Sampling: From Colored Fields to Thin Junction Trees , 2011, AISTATS.

[24]  Doug Downey,et al.  It’s a Contradiction – no, it’s not: A Case Study using Functional Relations , 2008, EMNLP.

[25]  Oren Etzioni,et al.  Identifying Relations for Open Information Extraction , 2011, EMNLP.

[26]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[27]  Pedro M. Domingos,et al.  Entity Resolution with Markov Logic , 2006, Sixth International Conference on Data Mining (ICDM'06).

[28]  Raymond J. Mooney,et al.  Discriminative structure and parameter learning for Markov logic networks , 2008, ICML '08.

[29]  Andrew McCallum,et al.  Query-Aware MCMC , 2011, NIPS.

[30]  Haixun Wang,et al.  Probase: a probabilistic taxonomy for text understanding , 2012, SIGMOD Conference.

[31]  J. Ross Quinlan,et al.  Learning logical definitions from relations , 1990, Machine Learning.

[32]  Yang Chen,et al.  Web-Scale Knowledge Inference Using Markov Logic Networks , 2013 .

[33]  Oren Etzioni,et al.  Scaling Textual Inference to the Web , 2008, EMNLP.

[34]  Daisy Zhe Wang,et al.  BayesStore: managing large, uncertain data repositories with probabilistic graphical models , 2008, Proc. VLDB Endow..

[35]  Joseph Gonzalez,et al.  Residual Splash for Optimally Parallelizing Belief Propagation , 2009, AISTATS.

[36]  Pedro M. Domingos,et al.  Lifted First-Order Belief Propagation , 2008, AAAI.

[37]  Pedro M. Domingos,et al.  Learning Markov logic network structure via hypergraph lifting , 2009, ICML '09.

[38]  Brendan J. Frey,et al.  Factor graphs and the sum-product algorithm , 2001, IEEE Trans. Inf. Theory.

[39]  Christopher Ré,et al.  Scaling Inference for Markov Logic via Dual Decomposition , 2012, 2012 IEEE 12th International Conference on Data Mining.

[40]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[41]  Pedro M. Domingos,et al.  Probabilistic theorem proving , 2011, UAI.

[42]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[43]  Stephen Muggleton,et al.  Inverse entailment and progol , 1995, New Generation Computing.

[44]  Pedro M. Domingos,et al.  Joint Inference in Information Extraction , 2007, AAAI.

[45]  X. Jin Factor graphs and the Sum-Product Algorithm , 2002 .

[46]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[47]  Pedro M. Domingos,et al.  Learning Markov Logic Networks Using Structural Motifs , 2010, ICML.

[48]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[49]  Oren Etzioni,et al.  Open Information Extraction: The Second Generation , 2011, IJCAI.

[50]  MADden: query-driven statistical text analytics , 2012, CIKM '12.

[51]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[52]  Jennifer Widom,et al.  Trio: A System for Integrated Management of Data, Accuracy, and Lineage , 2004, CIDR.

[53]  Andrew McCallum,et al.  Scalable probabilistic databases with factor graphs and MCMC , 2010, Proc. VLDB Endow..

[54]  Raymond J. Mooney,et al.  Online Inference-Rule Learning from Natural-Language Extractions , 2013, StarAI@AAAI.