Unsupervised Discovery of Domain-Specific Knowledge from Text

Learning by Reading (LbR) aims at enabling machines to acquire knowledge from and reason about textual input. This requires knowledge about the domain structure (such as entities, classes, and actions) in order to do inference. We present a method to infer this implicit knowledge from unlabeled text. Unlike previous approaches, we use automatically extracted classes with a probability distribution over entities to allow for context-sensitive labeling. From a corpus of 1.4m sentences, we learn about 250k simple propositions about American football in the form of predicate-argument structures like "quarterbacks throw passes to receivers". Using several statistical measures, we show that our model is able to generalize and explain the data statistically significantly better than various baseline approaches. Human subjects judged up to 96.6% of the resulting propositions to be sensible. The classes and probabilistic model can be used in textual enrichment to improve the performance of LbR end-to-end systems.

[1]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[2]  J. Guilford,et al.  A Note on the G Index of Agreement , 1964 .

[3]  Samuel Brody Clustering Clauses for High-Level Relation Detection: An Information-theoretic Approach , 2007, ACL.

[4]  Daniel Marcu,et al.  Unsupervised Learning of Verb Argument Structures , 2006, CICLing.

[5]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[6]  A. Feinstein,et al.  High agreement but low kappa: I. The problems of two paradoxes. , 1990, Journal of clinical epidemiology.

[7]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[8]  Simone Paolo Ponzetto,et al.  Knowledge-Rich Word Sense Disambiguation Rivaling Supervised Systems , 2010, ACL.

[9]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[10]  Stephanie Strassel,et al.  The DARPA Machine Reading Program - Encouraging Linguistic and Reasoning Research with a Series of Reading Tasks , 2010, LREC.

[11]  Oren Etzioni,et al.  A Latent Dirichlet Allocation Method for Selectional Preferences , 2010, ACL.

[12]  John B. Lowe,et al.  The Berkeley FrameNet Project , 1998, ACL.

[13]  Aditya Kalyanpur,et al.  PRISMATIC: Inducing Knowledge from a Large Scale Lexicalized Relation Resource , 2010, HLT-NAACL 2010.

[14]  Anselmo Peñas,et al.  Semantic Enrichment of Text with Background Knowledge , 2010, HLT-NAACL 2010.

[15]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[16]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[17]  Daniel Gildea,et al.  Automatic Labeling of Semantic Roles , 2000, ACL.

[18]  Namhee Kwon,et al.  Maximum Entropy Models for FrameNet Classification , 2003, EMNLP.

[19]  Jerry R. Hobbs,et al.  Proceedings of the NAACL HLT 2010 First International Workshop on Formalisms and Methodology for Learning by Reading , 2010, HLT-NAACL 2010.

[20]  K. Gwet Computing inter-rater reliability and its variance in the presence of high agreement. , 2008, The British journal of mathematical and statistical psychology.