Open Information Extraction from the Web

Traditionally, Information Extraction (IE) has focused on satisfying precise, narrow, pre-specified requests from small homogeneous corpora (e.g., extract the location and time of seminars from a set of announcements). Shifting to a new domain requires the user to name the target relations and to manually create new extraction rules or hand-tag new training examples. This manual labor scales linearly with the number of target relations. This paper introduces Open IE (OIE), a new extraction paradigm where the system makes a single data-driven pass over its corpus and extracts a large set of relational tuples without requiring any human input. The paper also introduces TEXTRUNNER, a fully implemented, highly scalable OIE system where the tuples are assigned a probability and indexed to support efficient extraction and exploration via user queries. We report on experiments over a 9,000,000 Web page corpus that compare TEXTRUNNER with KNOWITALL, a state-of-the-art Web IE system. TEXTRUNNER achieves an error reduction of 33% on a comparable set of extractions. Furthermore, in the amount of time it takes KNOWITALL to perform extraction for a handful of pre-specified relations, TEXTRUNNER extracts a far broader set of facts reflecting orders of magnitude more relations, discovered on the fly. We report statistics on TEXTRUNNER’s 11,000,000 highest probability tuples, and show that they contain over 1,000,000 concrete facts and over 6,500,000more abstract assertions.

[1]  Lewis M. Norton,et al.  Proceedings of the International Joint Conference on Artificial Intelligence : IJCAI-69, 7-9 May 1969, Washington, D.C. , 1969 .

[2]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[3]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[4]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[5]  Virginia R. de Sa,et al.  Learning Classification with Unlabeled Data , 1993, NIPS.

[6]  David A. Hull Using statistical testing in the evaluation of retrieval experiments , 1993, SIGIR.

[7]  Dan I. Moldovan,et al.  Acquisition of semantic patterns for information extraction from corpora , 1993, Proceedings of 9th IEEE Conference on Artificial Intelligence for Applications.

[8]  David Fisher,et al.  CRYSTAL: Inducing a Conceptual Dictionary , 1995, IJCAI.

[9]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[10]  Pedro M. Domingos,et al.  Unifying Instance-Based and Rule-Based Induction , 1996 .

[11]  Ellen Riloff,et al.  Automatically Generating Extraction Patterns from Untagged Text , 1996, AAAI/IAAI, Vol. 2.

[12]  Stephen Glenn Soderland,et al.  Learning text analysis rules for domain-specific natural language processing , 1996 .

[13]  Thomas G. Dietterich,et al.  Solving the Multiple Instance Problem with Axis-Parallel Rectangles , 1997, Artif. Intell..

[14]  Andrew R. Bailey Is man the measure , 1997 .

[15]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[16]  Dayne Freitag,et al.  Multistrategy Learning for Information Extraction , 1998, ICML.

[17]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[18]  Mitchell P. Marcus,et al.  Maximum entropy models for natural language ambiguity resolution , 1998 .

[19]  Ellen Riloff,et al.  Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping , 1999, AAAI/IAAI.

[20]  Ian H. Witten,et al.  Issues in Stacked Generalization , 2011, J. Artif. Intell. Res..

[21]  Luis Gravano,et al.  Extracting Relations from Large Plain-Text Collections , 1999 .

[22]  Eric Brill,et al.  Man* vs. Machine: A Case Study in Base Noun Phrase Learning , 1999, ACL.

[23]  Andrew McCallum,et al.  Information Extraction with HMM Structures Learned by Stochastic Optimization , 2000, AAAI/IAAI.

[24]  Tom M. Mitchell,et al.  Learning to construct knowledge bases from the World Wide Web , 2000, Artif. Intell..

[25]  David Yarowsky,et al.  Rule Writing or Annotation: Cost-efficient Resource Usage for Base Noun Phrase Chunking , 2000, ACL.

[26]  Sabine Buchholz,et al.  Introduction to the CoNLL-2000 Shared Task Chunking , 2000, CoNLL/LLL.

[27]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[28]  Oren Etzioni,et al.  Scaling question answering to the Web , 2001, WWW '01.

[29]  Michele Banko,et al.  Scaling to Very Very Large Corpora for Natural Language Disambiguation , 2001, ACL.

[30]  Patrick Pantel,et al.  Discovery of inference rules for question-answering , 2001, Natural Language Engineering.

[31]  Oren Etzioni,et al.  Scaling question answering to the Web , 2001, WWW '01.

[32]  Grace Ngai,et al.  Transformation Based Learning in the Fast Lane , 2001, NAACL.

[33]  Martin Volk,et al.  Exploiting the WWW as a corpus to resolve PP attachment ambiguities , 2001 .

[34]  John A. Carroll,et al.  Applied morphological processing of English , 2001, Natural Language Engineering.

[35]  Daniel Gildea,et al.  Corpus Variation and Parser Performance , 2001, EMNLP.

[36]  Patrick Pantel,et al.  DIRT @SBT@discovery of inference rules from text , 2001, KDD '01.

[37]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[38]  James R. Curran,et al.  A Very Very Large Corpus Doesn’t Always Yield Reliable Estimates , 2002, CoNLL.

[39]  Bernard Zenko,et al.  Stacking with an Extended Set of Meta-level Attributes and MLR , 2002, ECML.

[40]  Dan Klein,et al.  Fast Exact Inference with a Factored Model for Natural Language Parsing , 2002, NIPS.

[41]  Frank Keller,et al.  Using the Web to Overcome Data Sparseness , 2002, EMNLP.

[42]  Lenhart K. Schubert Can we derive general world knowledge from texts , 2002 .

[43]  Jimmy J. Lin,et al.  Web question answering: is more always better? , 2002, SIGIR '02.

[44]  James Curran,et al.  Ensemble Methods for Automatic Thesaurus Extraction , 2002, EMNLP.

[45]  Eduard H. Hovy,et al.  Learning surface text patterns for a Question Answering System , 2002, ACL.

[46]  Andrew McCallum,et al.  Efficiently Inducing Features of Conditional Random Fields , 2002, UAI.

[47]  Michael Collins,et al.  Head-Driven Statistical Models for Natural Language Parsing , 2003, CL.

[48]  Lenhart K. Schubert,et al.  Extracting and evaluating general world knowledge from the Brown Corpus , 2003, HLT-NAACL 2003.

[49]  Peter Clark,et al.  A knowledge-driven approach to text meaning processing , 2003, HLT-NAACL 2003.

[50]  Dekang Lin,et al.  Dependency-Based Evaluation of Minipar , 2003 .

[51]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[52]  Dmitry Zelenko,et al.  Kernel methods for relation extraction , 2003 .

[53]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[54]  Andrew McCallum,et al.  Accurate Information Extraction from Research Papers using Conditional Random Fields , 2004, NAACL.

[55]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[56]  Dayne Freitag,et al.  Machine Learning for Information Extraction in Informal Domains , 2000, Machine Learning.

[57]  Dan Roth,et al.  A Linear Programming Formulation for Global Inference in Natural Language Tasks , 2004, CoNLL.

[58]  Eduard Hovy,et al.  Towards terascale knowledge acquisition , 2004, COLING 2004.

[59]  Aron Culotta,et al.  Dependency Tree Kernels for Relation Extraction , 2004, ACL.

[60]  Nanda Kambhatla,et al.  Combining Lexical, Syntactic, and Semantic Features with Maximum Entropy Models for Information Extraction , 2004, ACL.

[61]  Steffen Staab,et al.  Towards the self-annotating web , 2004, WWW '04.

[62]  Barbara Rosario,et al.  Classifying Semantic Relations in Bioscience Texts , 2004, ACL.

[63]  Andrew McCallum,et al.  Confidence Estimation for Information Extraction , 2004, NAACL.

[64]  Adwait Ratnaparkhi,et al.  Learning to Parse Natural Language with Maximum Entropy Models , 1999, Machine Learning.

[65]  Daniel Jurafsky,et al.  Learning Syntactic Patterns for Automatic Hypernym Discovery , 2004, NIPS.

[66]  Seth Kulick,et al.  Proposition Bank II: Delving Deeper , 2004, FCP@NAACL-HLT.

[67]  Doug Downey,et al.  A Probabilistic Model of Redundancy in Information Extraction , 2005, IJCAI.

[68]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[69]  Joakim Nivre,et al.  MaltParser: A Language-Independent System for Data-Driven Dependency Parsing , 2007, Natural Language Engineering.

[70]  Mikhail Belkin,et al.  A Co-Regularization Approach to Semi-supervised Learning with Multiple Views , 2005 .

[71]  Razvan C. Bunescu,et al.  Subsequence Kernels for Relation Extraction , 2005, NIPS.

[72]  Aldo Gangemi,et al.  Unsupervised Learning of Semantic Relations between Concepts of a Molecular Biology Ontology , 2005, IJCAI.

[73]  Razvan C. Bunescu,et al.  A Shortest Path Dependency Kernel for Relation Extraction , 2005, HLT.

[74]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[75]  Georgios Paliouras,et al.  Combining Information Extraction Systems Using Voting and Stacked Generalization , 2005, J. Mach. Learn. Res..

[76]  Preslav Nakov,et al.  Using the Web as an Implicit Training Set: Application to Structural Ambiguity Resolution , 2005, HLT.

[77]  Doug Downey,et al.  KnowItNow: Fast, Scalable Information Extraction from the Web , 2005, HLT.

[78]  Oren Etzioni,et al.  Extracting Product Features and Opinions from Reviews , 2005, HLT.

[79]  Siqing Du,et al.  An Automated Multi-component Approach to Extracting Entity Relationships from Database Requirement Specification Documents , 2006, NLDB.

[80]  Oren Etzioni,et al.  Self-supervised Relation Extraction from the Web , 2006, ISMIS.

[81]  Satoshi Sekine,et al.  On-Demand Information Extraction , 2006, ACL.

[82]  Jeffrey P. Bigham,et al.  Names and Similarities on the Web: Fact Extraction in the Fast Lane , 2006, ACL.

[83]  Ivan Titov,et al.  Porting Statistical Parsers with Data-Defined Kernels , 2006, CoNLL.

[84]  Ronen Feldman,et al.  URES : an Unsupervised Web Relation Extraction System , 2006, ACL.

[85]  Satoshi Sekine,et al.  Preemptive Information Extraction using Unrestricted Relation Discovery , 2006, NAACL.

[86]  Andrew McCallum,et al.  Integrating Probabilistic Extraction Models and Data Mining to Discover Relations and Patterns in Text , 2006, NAACL.

[87]  Matthew Richardson,et al.  Markov logic networks , 2006, Machine Learning.

[88]  Dan I. Moldovan,et al.  Automatic Discovery of Part-Whole Relations , 2006, CL.

[89]  Dong-Hong Ji,et al.  Relation Extraction Using Label Propagation Based Semi-Supervised Learning , 2006, ACL.

[90]  Oren Etzioni,et al.  Relational Web Search , 2006 .

[91]  Dale Schuurmans,et al.  Semi-Supervised Conditional Random Fields for Improved Sequence Segmentation and Labeling , 2006, ACL.

[92]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[93]  Ronen Feldman,et al.  Self-supervised relation extraction from the Web , 2007, Knowledge and Information Systems.

[94]  Simone Paolo Ponzetto,et al.  Deriving a Large-Scale Taxonomy from Wikipedia , 2007, AAAI.

[95]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[96]  Daniel S. Weld,et al.  Autonomously semantifying wikipedia , 2007, CIKM '07.

[97]  Ani Nenkova,et al.  NAACL HLT 2007 - Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Main Conference , 2007, HLT-NAACL 2007.

[98]  Razvan C. Bunescu,et al.  Learning to Extract Relations from the Web using Minimal Supervision , 2007, ACL.

[99]  Oren Etzioni,et al.  Strategies for lifelong knowledge extraction from the web , 2007, K-CAP '07.

[100]  Oren Etzioni,et al.  Information extraction from unstructured web text , 2007 .

[101]  Pedro M. Domingos,et al.  Joint Inference in Information Extraction , 2007, AAAI.

[102]  Oren Etzioni,et al.  Unsupervised Resolution of Objects and Relations on the Web , 2007, NAACL.

[103]  Doug Downey,et al.  Sparse Information Extraction: Unsupervised Language Models to the Rescue , 2007, ACL.

[104]  Oren Etzioni,et al.  Information extraction from the web: techniques and applications , 2007 .

[105]  ChengXiang Zhai,et al.  A Systematic Exploration of the Feature Space for Relation Extraction , 2007, NAACL.

[106]  Doug Downey,et al.  Locating Complex Named Entities in Web Text , 2007, IJCAI.

[107]  Yoav Seginer,et al.  Fast Unsupervised Incremental Parsing , 2007, ACL.

[108]  Zhi-Hua Zhou,et al.  Semi-Supervised Regression with Co-Training Style Algorithms , 2007 .

[109]  Gerhard Weikum,et al.  YAGO: A Large Ontology from Wikipedia and WordNet , 2008, J. Web Semant..

[110]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[111]  Oren Etzioni,et al.  The Tradeoffs Between Open and Traditional Relation Extraction , 2008, ACL.

[112]  Jun'ichi Tsujii,et al.  Shift-Reduce Dependency DAG Parsing , 2008, COLING.

[113]  James Fogarty,et al.  Intelligence in Wikipedia , 2008, AAAI.

[114]  Daniel S. Weld,et al.  Information extraction from Wikipedia: moving down the long tail , 2008, KDD.

[115]  Oren Etzioni,et al.  Scaling Textual Inference to the Web , 2008, EMNLP.