Analysis of a probabilistic model of redundancy in unsupervised information extraction

Unsupervised Information Extraction (UIE) is the task of extracting knowledge from text without the use of hand-labeled training examples. Because UIE systems do not require human intervention, they can recursively discover new relations, attributes, and instances in a scalable manner. When applied to massive corpora such as the Web, UIE systems present an approach to a primary challenge in artificial intelligence: the automatic accumulation of massive bodies of knowledge. A fundamental problem for a UIE system is assessing the probability that its extracted information is correct. In massive corpora such as the Web, the same extraction is found repeatedly in different documents. How does this redundancy impact the probability of correctness? We present a combinatorial ''balls-and-urns'' model, called Urns, that computes the impact of sample size, redundancy, and corroboration from multiple distinct extraction rules on the probability that an extraction is correct. We describe methods for estimating Urns's parameters in practice and demonstrate experimentally that for UIE the model's log likelihoods are 15 times better, on average, than those obtained by methods used in previous work. We illustrate the generality of the redundancy model by detailing multiple applications beyond UIE in which Urns has been effective. We also provide a theoretical foundation for Urns's performance, including a theorem showing that PAC Learnability in Urns is guaranteed without hand-labeled data, under certain assumptions.

[1]  Jeffrey P. Bigham,et al.  Organizing and Searching the World Wide Web of Facts - Step One: The One-Million Fact Extraction Challenge , 2006, AAAI.

[2]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[3]  Oren Etzioni,et al.  The Tradeoffs Between Open and Traditional Relation Extraction , 2008, ACL.

[4]  Ellen Riloff,et al.  Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping , 1999, AAAI/IAAI.

[5]  Lenhart K. Schubert Can we derive general world knowledge from texts , 2002 .

[6]  J. Nichols,et al.  Statistical inference for capture-recapture experiments , 1992 .

[7]  Maria Liakata,et al.  From Trees to Predicate-argument Structures , 2002, COLING.

[8]  Ralph Grishman,et al.  Bootstrapped Learning of Semantic Classes from Positive and Negative Examples , 2003 .

[9]  Doug Downey,et al.  It’s a Contradiction – no, it’s not: A Case Study using Functional Relations , 2008, EMNLP.

[10]  Andrew McCallum,et al.  Using Maximum Entropy for Text Classification , 1999 .

[11]  Mitchell P. Marcus Proceedings of the second international conference on Human Language Technology Research , 2002 .

[12]  Andrew McCallum,et al.  Efficiently Inducing Features of Conditional Random Fields , 2002, UAI.

[13]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[14]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[15]  Wei-Liem Loh,et al.  Estimating the Mixing Density of a Mixture of Power Series Distributions , 1994 .

[16]  Fabio Ciravegna,et al.  Adaptive Information Extraction from Text by Rule Induction and Generalisation , 2001, IJCAI.

[17]  Lucy Vanderwende,et al.  MindNet: Acquiring and Structuring Semantic Information from Text , 1998, COLING-ACL.

[18]  Mark Craven,et al.  Evidence combination in biomedical natural-language processing , 2003, BIOKDD.

[19]  Andrew McCallum,et al.  Confidence Estimation for Information Extraction , 2004, NAACL.

[20]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[21]  Doug Downey,et al.  KnowItNow: Fast, Scalable Information Extraction from the Web , 2005, HLT.

[22]  Doug Downey,et al.  Sparse Information Extraction: Unsupervised Language Models to the Rescue , 2007, ACL.

[23]  Andrew McCallum,et al.  Information Extraction with HMM Structures Learned by Stochastic Optimization , 2000, AAAI/IAAI.

[24]  Oren Etzioni,et al.  Unsupervised Resolution of Objects and Relations on the Web , 2007, NAACL.

[25]  Raymond J. Mooney,et al.  Relational Learning of Pattern-Match Rules for Information Extraction , 1999, CoNLL.

[26]  Christopher D. Manning,et al.  Unsupervised Discovery of a Statistical Verb Lexicon , 2006, EMNLP.

[27]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[28]  Oren Etzioni,et al.  The use of web-based statistics to validate, information extraction , 2004, AAAI 2004.

[29]  Rainer Storn,et al.  Differential Evolution – A Simple and Efficient Heuristic for global Optimization over Continuous Spaces , 1997, J. Glob. Optim..

[30]  Stuart J. Russell,et al.  BLOG: Probabilistic Models with Unknown Objects , 2005, IJCAI.

[31]  Andrew McCallum,et al.  Information Extraction with HMMs and Shrinkage , 1999 .

[32]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[33]  Christopher D. Manning,et al.  Finding Contradictions in Text , 2008, ACL.

[34]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[35]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[36]  William A. Gale,et al.  Good-Turing Frequency Estimation Without Tears , 1995, J. Quant. Linguistics.

[37]  Doug Downey,et al.  A Probabilistic Model of Redundancy in Information Extraction , 2005, IJCAI.

[38]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.

[39]  Suzanne Stevenson,et al.  Unsupervised Semantic Role Labellin , 2004, EMNLP.

[40]  Doug Downey,et al.  Look Ma, No Hands: Analyzing the Monotonic Feature Abstraction for Text Classification , 2008, NIPS.

[41]  Doug Downey,et al.  Methods for Domain-Independent Information Extraction from the Web: An Experimental Comparison , 2004, AAAI.

[42]  Oren Etzioni,et al.  Redundancy in web-scaled information extraction: probabilistic model and experimental results , 2008 .

[43]  Doug Downey,et al.  Improved Extraction Assessment through Better Language Models , 2010, HLT-NAACL.

[44]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.