Incremental Knowledge Base Construction Using DeepDive

Populating a database with unstructured information is a long-standing problem in industry and research that encompasses problems of extraction, cleaning, and integration. Recent names used for this problem include dealing with dark data and knowledge base construction (KBC). In this work, we describe DeepDive, a system that combines database and machine learning ideas to help develop KBC systems, and we present techniques to make the KBC process more efficient. We observe that the KBC process is iterative, and we develop techniques to incrementally produce inference results for KBC systems. We propose two methods for incremental inference, based respectively on sampling and variational techniques. We also study the tradeoff space of these methods and develop a simple rule-based optimizer. DeepDive includes all of these contributions, and we evaluate Deep-Dive on five KBC systems, showing that it can speed up KBC inference tasks by up to two orders of magnitude with negligible impact on quality.

[1]  Bo Zhang,et al.  StatSnowball: a statistical approach to extracting entity relationships , 2009, WWW '09.

[2]  Simon Kasif,et al.  Logarithmic-Time Updates and Queries in Probabilistic Networks , 1995, UAI.

[3]  Hans Uszkoreit,et al.  Large-Scale Learning of Relation-Extraction Rules with Distant Supervision from the Web , 2012, International Semantic Web Conference.

[4]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[5]  Alessandro Moschitti,et al.  End-to-End Relation Extraction Using Distant Supervision from External Semantic Repositories , 2011, ACL.

[6]  Frederick Reiss,et al.  SystemT: A Declarative Information Extraction System , 2011, ACL.

[7]  Dejing Dou,et al.  Learning to Refine an Automatically Extracted Knowledge Base Using Markov Logic , 2012, 2012 IEEE 12th International Conference on Data Mining.

[8]  Mark Craven,et al.  Constructing Biological Knowledge Bases by Extracting Information from Text Sources , 1999, ISMB.

[9]  Oren Etzioni,et al.  TextRunner: Open Information Extraction on the Web , 2007, NAACL.

[10]  Christopher Ré,et al.  Tuffy: Scaling up Statistical Inference in Markov Logic Networks using an RDBMS , 2011, Proc. VLDB Endow..

[11]  Jeffrey F. Naughton,et al.  Declarative Information Extraction Using Datalog with Embedded Extraction Predicates , 2007, VLDB.

[12]  V. S. Subrahmanian,et al.  Maintaining views incrementally , 1993, SIGMOD Conference.

[13]  Christian P. Robert,et al.  Monte Carlo Statistical Methods , 2005, Springer Texts in Statistics.

[14]  Ashish Gupta,et al.  Materialized views: techniques, implementations, and applications , 1999 .

[15]  Andrew McCallum,et al.  Modeling Relations and Their Mentions without Labeled Text , 2010, ECML/PKDD.

[16]  Amir Sadeghian,et al.  Feature Engineering for Knowledge Base Construction , 2014, IEEE Data Eng. Bull..

[17]  Razvan C. Bunescu,et al.  Learning to Extract Relations from the Web using Minimal Supervision , 2007, ACL.

[18]  Martin J. Wainwright,et al.  Log-determinant relaxation for approximate inference in discrete Markov random fields , 2006, IEEE Transactions on Signal Processing.

[19]  Min Wang,et al.  Optimizing Statistical Information Extraction Programs over Evolving Text , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[20]  Peter J. Haas,et al.  MCDB: a monte carlo approach to managing uncertain data , 2008, SIGMOD Conference.

[21]  Christopher Ré,et al.  Understanding Tables in Context Using Standard NLP Toolkits , 2013, ACL.

[22]  Estevam R. Hruschka,et al.  Toward Never Ending Language Learning , 2009, AAAI Spring Symposium: Learning by Reading and Learning to Read.

[23]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[24]  Jayant Madhavan,et al.  Web-Scale Data Integration: You can afford to Pay as You Go , 2007, CIDR.

[25]  Grigorios Tsoumakas,et al.  An adaptive personalized news dissemination system , 2009, Journal of Intelligent Information Systems.

[26]  Pedro M. Domingos,et al.  Joint Inference in Information Extraction , 2007, AAAI.

[27]  Denilson Barbosa,et al.  Shallow Information Extraction for the knowledge Web , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[28]  Christopher De Sa,et al.  DeepDive: Declarative Knowledge Base Construction , 2016, SGMD.

[29]  Christopher Ré,et al.  Elementary: Large-Scale Knowledge-Base Construction via Machine Learning and Statistical Inference , 2012, Int. J. Semantic Web Inf. Syst..

[30]  Christopher D. Manning,et al.  Stanford's 2014 Slot Filling Systems , 2014 .

[31]  Georg Gottlob,et al.  The Lixto data extraction project: back and forth between theory and practice , 2004, PODS.

[32]  C. Ré,et al.  A Machine Reading System for Assembling Synthetic Paleontological Databases , 2014, PloS one.

[33]  Mark Jerrum,et al.  Polynomial-Time Approximation Algorithms for the Ising Model , 1990, SIAM J. Comput..

[34]  Wei Zhang,et al.  From Data Fusion to Knowledge Fusion , 2014, Proc. VLDB Endow..

[35]  Phil Blunsom,et al.  A Convolutional Neural Network for Modelling Sentences , 2014, ACL.

[36]  Luke S. Zettlemoyer,et al.  Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations , 2011, ACL.

[37]  Gerhard Weikum,et al.  Scalable knowledge harvesting with high precision and high recall , 2011, WSDM '11.

[38]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[39]  Jun Yang,et al.  Efficient Information Extraction over Evolving Text Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[40]  Dan Suciu,et al.  Probabilistic databases , 2011, SIGA.

[41]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.

[42]  J. William Murdock,et al.  IBM Research Report Tools and Methods for Building Watson , 2013 .

[43]  Rada Chirkova,et al.  Materialized Views , 2012, Found. Trends Databases.

[44]  Dan Suciu,et al.  The dichotomy of probabilistic inference for unions of conjunctive queries , 2012, JACM.

[45]  Jeffrey D. Ullman,et al.  Principles Of Database And Knowledge-Base Systems , 1979 .

[46]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[47]  Bin Yu,et al.  Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of boldmathell_1-regularized MLE , 2008, NIPS 2008.

[48]  Distant Supervision for Relation Extraction with Matrix Completion , 2014, ACL.

[49]  Christopher Ré,et al.  Big Data versus the Crowd: Looking for Relationships in All the Right Places , 2012, ACL.

[50]  Elizabeth L. Wilmer,et al.  Markov Chains and Mixing Times , 2008 .

[51]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[52]  Valentin I. Spitkovsky,et al.  A Simple Distant Supervision Approach for the TAC-KBP Slot Filling Task , 2010, TAC.

[53]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[54]  Stuart Adam Battersby,et al.  Experimenting with Distant Supervision for Emotion Classification , 2012, EACL.

[55]  Jeffrey D. Ullman,et al.  Principles of Database and Knowledge-Base Systems, Volume II , 1988, Principles of computer science series.

[56]  Andrew McCallum,et al.  Collective Cross-Document Relation Extraction Without Labelled Data , 2010, EMNLP.

[57]  Ralph Grishman,et al.  Distant Supervision for Relation Extraction with an Incomplete Knowledge Base , 2013, NAACL.

[58]  Umut A. Acar,et al.  Adaptive inference on general graphical models , 2008, UAI.

[59]  Nando de Freitas,et al.  An Introduction to MCMC for Machine Learning , 2004, Machine Learning.

[60]  Daisy Zhe Wang,et al.  Knowledge expansion over probabilistic knowledge bases , 2014, SIGMOD Conference.

[61]  Christian P. Robert,et al.  Monte Carlo Statistical Methods (Springer Texts in Statistics) , 2005 .

[62]  Alexandre d'Aspremont,et al.  Model Selection Through Sparse Max Likelihood Estimation Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data , 2022 .

[63]  Pedro M. Domingos,et al.  Efficient Belief Propagation for Utility Maximization and Repeated Inference , 2010, AAAI.

[64]  Milos Nikolic,et al.  LINVIEW: incremental view maintenance for complex analytical queries , 2014, SIGMOD Conference.

[65]  Eduard H. Hovy,et al.  Weakly Supervised User Profile Extraction from Twitter , 2014, ACL.

[66]  Andrew McCallum,et al.  Query-Aware MCMC , 2011, NIPS.

[67]  Christopher Ré,et al.  DimmWitted: A Study of Main-Memory Statistical Analytics , 2014, Proc. VLDB Endow..

[68]  Daniel S. Weld,et al.  Learning 5000 Relational Extractors , 2010, ACL.

[69]  Pedro M. Domingos,et al.  Markov Logic: An Interface Layer for Artificial Intelligence , 2009, Markov Logic: An Interface Layer for Artificial Intelligence.

[70]  Lise Getoor,et al.  PrDB: managing and exploiting rich correlations in probabilistic databases , 2009, The VLDB Journal.

[71]  Zhifang Sui,et al.  Towards Accurate Distant Supervision for Relational Facts Extraction , 2013, ACL.

[72]  Gerhard Weikum,et al.  The YAGO-NAGA approach to knowledge discovery , 2009, SGMD.

[73]  Andrew McCallum,et al.  Scalable probabilistic databases with factor graphs and MCMC , 2010, Proc. VLDB Endow..

[74]  Jennifer Chu-Carroll,et al.  Building Watson: An Overview of the DeepQA Project , 2010, AI Mag..

[75]  Christopher Ré,et al.  Towards high-throughput gibbs sampling at scale: a study across storage managers , 2013, SIGMOD '13.

[76]  Gerhard Weikum,et al.  From information to knowledge: harvesting entities and relationships from web sources , 2010, PODS '10.

[77]  Christopher Ré,et al.  Incrementally Maintaining Classification using an RDBMS , 2011, Proc. VLDB Endow..

[78]  João Gama,et al.  A survey on concept drift adaptation , 2014, ACM Comput. Surv..

[79]  Nathanael Chambers,et al.  Learning for Microblogs with Distant Supervision: Political Forecasting with Twitter , 2012, EACL.

[80]  E. Jaynes Probability theory : the logic of science , 2003 .

[81]  Gerhard Weikum,et al.  SOFIE: a self-organizing framework for information extraction , 2009, WWW '09.