Incremental Knowledge Base Construction Using DeepDive

Populating a database with information from unstructured sources—also known as knowledge base construction (KBC)—is a long-standing problem in industry and research that encompasses problems of extraction, cleaning, and integration. In this work, we describe DeepDive, a system that combines database and machine learning ideas to help develop KBC systems, and we present techniques to make the KBC process more efficient. We observe that the KBC process is iterative, and we develop techniques to incrementally produce inference results for KBC systems. We propose two methods for incremental inference, based, respectively, on sampling and variational techniques. We also study the trade-off space of these methods and develop a simple rule-based optimizer. DeepDive includes all of these contributions, and we evaluate DeepDive on five KBC systems, showing that it can speed up KBC inference tasks by up to two orders of magnitude with negligible impact on quality.

[1]  Jeffrey D. Ullman,et al.  Principles of Database and Knowledge-Base Systems, Volume II , 1988, Principles of computer science series.

[2]  Jeffrey D. Uuman Principles of database and knowledge- base systems , 1989 .

[3]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[4]  Mark Jerrum,et al.  Polynomial-Time Approximation Algorithms for the Ising Model , 1990, SIAM J. Comput..

[5]  V. S. Subrahmanian,et al.  Maintaining views incrementally , 1993, SIGMOD Conference.

[6]  Simon Kasif,et al.  Logarithmic-Time Updates and Queries in Probabilistic Networks , 1995, UAI.

[7]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[8]  Mark Craven,et al.  Constructing Biological Knowledge Bases by Extracting Information from Text Sources , 1999, ISMB.

[9]  Ashish Gupta,et al.  Materialized views: techniques, implementations, and applications , 1999 .

[10]  Hoon Kim,et al.  Monte Carlo Statistical Methods , 2000, Technometrics.

[11]  Nando de Freitas,et al.  An Introduction to MCMC for Machine Learning , 2004, Machine Learning.

[12]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.

[13]  Georg Gottlob,et al.  The Lixto data extraction project: back and forth between theory and practice , 2004, PODS '04.

[14]  M. Tribus,et al.  Probability theory: the logic of science , 2003 .

[15]  Christian P. Robert,et al.  Monte Carlo Statistical Methods (Springer Texts in Statistics) , 2005 .

[16]  Martin J. Wainwright,et al.  Log-determinant relaxation for approximate inference in discrete Markov random fields , 2006, IEEE Transactions on Signal Processing.

[17]  Jeffrey F. Naughton,et al.  Declarative Information Extraction Using Datalog with Embedded Extraction Predicates , 2007, VLDB.

[18]  Jayant Madhavan,et al.  Web-Scale Data Integration: You can afford to Pay as You Go , 2007, CIDR.

[19]  Jayant Madhavan,et al.  Web-Scale Data Integration: You can afford to Pay as You Go , 2007, CIDR.

[20]  Umut A. Acar,et al.  Adaptive Bayesian inference , 2007, NIPS 2007.

[21]  Razvan C. Bunescu,et al.  Learning to Extract Relations from the Web using Minimal Supervision , 2007, ACL.

[22]  Pedro M. Domingos,et al.  Joint Inference in Information Extraction , 2007, AAAI.

[23]  Oren Etzioni,et al.  TextRunner: Open Information Extraction on the Web , 2007, NAACL.

[24]  Grigorios Tsoumakas,et al.  An adaptive personalized news dissemination system , 2009, Journal of Intelligent Information Systems.

[25]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[26]  Alexandre d'Aspremont,et al.  Model Selection Through Sparse Max Likelihood Estimation Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data , 2022 .

[27]  Umut A. Acar,et al.  Adaptive inference on general graphical models , 2008, UAI.

[28]  Jun Yang,et al.  Efficient Information Extraction over Evolving Text Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[29]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[30]  Martin J. Wainwright,et al.  Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of l1-regularized MLE , 2008, NIPS.

[31]  Peter J. Haas,et al.  MCDB: a monte carlo approach to managing uncertain data , 2008, SIGMOD Conference.

[32]  Pedro M. Domingos,et al.  Markov Logic: An Interface Layer for Artificial Intelligence , 2009, Markov Logic: An Interface Layer for Artificial Intelligence.

[33]  Estevam R. Hruschka,et al.  Toward Never Ending Language Learning , 2009, AAAI Spring Symposium: Learning by Reading and Learning to Read.

[34]  Bo Zhang,et al.  StatSnowball: a statistical approach to extracting entity relationships , 2009, WWW '09.

[35]  Rada Chirkova,et al.  Materialized Views , 2012, Found. Trends Databases.

[36]  Gerhard Weikum,et al.  SOFIE: a self-organizing framework for information extraction , 2009, WWW '09.

[37]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[38]  Lise Getoor,et al.  PrDB: managing and exploiting rich correlations in probabilistic databases , 2009, The VLDB Journal.

[39]  Frederick Reiss,et al.  SystemT: a system for declarative information extraction , 2009, SGMD.

[40]  Gerhard Weikum,et al.  The YAGO-NAGA approach to knowledge discovery , 2009, SGMD.

[41]  Andrew McCallum,et al.  Scalable probabilistic databases with factor graphs and MCMC , 2010, Proc. VLDB Endow..

[42]  Jennifer Chu-Carroll,et al.  Building Watson: An Overview of the DeepQA Project , 2010, AI Mag..

[43]  Pedro M. Domingos,et al.  Efficient Belief Propagation for Utility Maximization and Repeated Inference , 2010, AAAI.

[44]  Andrew McCallum,et al.  Collective Cross-Document Relation Extraction Without Labelled Data , 2010, EMNLP.

[45]  Daniel S. Weld,et al.  Learning 5000 Relational Extractors , 2010, ACL.

[46]  Gerhard Weikum,et al.  From information to knowledge: harvesting entities and relationships from web sources , 2010, PODS '10.

[47]  Valentin I. Spitkovsky,et al.  A Simple Distant Supervision Approach for the TAC-KBP Slot Filling Task , 2010, TAC.

[48]  Andrew McCallum,et al.  Modeling Relations and Their Mentions without Labeled Text , 2010, ECML/PKDD.

[49]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[50]  Andrew McCallum,et al.  Query-Aware MCMC , 2011, NIPS.

[51]  Christopher Ré,et al.  Tuffy: Scaling up Statistical Inference in Markov Logic Networks using an RDBMS , 2011, Proc. VLDB Endow..

[52]  Valentin I. Spitkovsky,et al.  Stanford's Distantly-Supervised Slot-Filling System , 2011, TAC.

[53]  Frederick Reiss,et al.  SystemT: A Declarative Information Extraction System , 2011, ACL.

[54]  Christopher Ré,et al.  Incrementally Maintaining Classification using an RDBMS , 2011, Proc. VLDB Endow..

[55]  Gerhard Weikum,et al.  Scalable knowledge harvesting with high precision and high recall , 2011, WSDM '11.

[56]  Luke S. Zettlemoyer,et al.  Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations , 2011, ACL.

[57]  Alessandro Moschitti,et al.  End-to-End Relation Extraction Using Distant Supervision from External Semantic Repositories , 2011, ACL.

[58]  Christopher Ré,et al.  Big Data versus the Crowd: Looking for Relationships in All the Right Places , 2012, ACL.

[59]  F. Hollander Probability Theory : The Coupling Method , 2012 .

[60]  Min Wang,et al.  Optimizing Statistical Information Extraction Programs over Evolving Text , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[61]  Hans Uszkoreit,et al.  Large-Scale Learning of Relation-Extraction Rules with Distant Supervision from the Web , 2012, International Semantic Web Conference.

[62]  Nathanael Chambers,et al.  Learning for Microblogs with Distant Supervision: Political Forecasting with Twitter , 2012, EACL.

[63]  Dejing Dou,et al.  Learning to Refine an Automatically Extracted Knowledge Base Using Markov Logic , 2012, 2012 IEEE 12th International Conference on Data Mining.

[64]  Stuart Adam Battersby,et al.  Experimenting with Distant Supervision for Emotion Classification , 2012, EACL.

[65]  Christopher Ré,et al.  Elementary: Large-Scale Knowledge-Base Construction via Machine Learning and Statistical Inference , 2012, Int. J. Semantic Web Inf. Syst..

[66]  V. Climenhaga Markov chains and mixing times , 2013 .

[67]  Christopher Ré,et al.  Towards high-throughput gibbs sampling at scale: a study across storage managers , 2013, SIGMOD '13.

[68]  J. William Murdock,et al.  Tools and Methods for Building Watson , 2013 .

[69]  Christopher Ré,et al.  Understanding Tables in Context Using Standard NLP Toolkits , 2013, ACL.

[70]  Ralph Grishman,et al.  Distant Supervision for Relation Extraction with an Incomplete Knowledge Base , 2013, NAACL.

[71]  Dan Suciu,et al.  The dichotomy of probabilistic inference for unions of conjunctive queries , 2012, JACM.

[72]  Zhifang Sui,et al.  Towards Accurate Distant Supervision for Relational Facts Extraction , 2013, ACL.

[73]  Denilson Barbosa,et al.  Shallow Information Extraction for the knowledge Web , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[74]  João Gama,et al.  A survey on concept drift adaptation , 2014, ACM Comput. Surv..

[75]  Christopher Ré,et al.  DimmWitted: A Study of Main-Memory Statistical Analytics , 2014, Proc. VLDB Endow..

[76]  Milos Nikolic,et al.  LINVIEW: incremental view maintenance for complex analytical queries , 2014, SIGMOD Conference.

[77]  Christopher D. Manning,et al.  Stanford's 2014 Slot Filling Systems , 2014 .

[78]  Phil Blunsom,et al.  A Convolutional Neural Network for Modelling Sentences , 2014, ACL.

[79]  Distant Supervision for Relation Extraction with Matrix Completion , 2014, ACL.

[80]  C. Ré,et al.  A Machine Reading System for Assembling Synthetic Paleontological Databases , 2014, PloS one.

[81]  Wei Zhang,et al.  From Data Fusion to Knowledge Fusion , 2014, Proc. VLDB Endow..

[82]  Eduard H. Hovy,et al.  Weakly Supervised User Profile Extraction from Twitter , 2014, ACL.

[83]  Daisy Zhe Wang,et al.  Knowledge expansion over probabilistic knowledge bases , 2014, SIGMOD Conference.

[84]  Amir Sadeghian,et al.  Feature Engineering for Knowledge Base Construction , 2014, IEEE Data Eng. Bull..

[85]  Prasoon Goyal,et al.  Probabilistic Databases , 2009, Encyclopedia of Database Systems.

[86]  Christopher De Sa,et al.  DeepDive: Declarative Knowledge Base Construction , 2016, SGMD.