Learning Domain-Specific Information Extraction Patterns from the Web

Many information extraction (IE) systems rely on manually annotated training data to learn patterns or rules for extracting information about events. Manually annotating data is expensive, however, and a new data set must be annotated for each domain. So most IE training sets are relatively small. Consequently, IE patterns learned from annotated training sets often have limited coverage. In this paper, we explore the idea of using the Web to automatically identify domain-specific IE patterns that were not seen in the training data. We use IE patterns learned from the MUC-4 training set as anchors to identify domain-specific web pages and then learn new IE patterns from them. We compute the semantic affinity of each new pattern to automatically infer the type of information that it will extract. Experiments on the MUC-4 test set show that these new IE patterns improved recall with only a small precision loss.

[1]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[2]  Beth Sundheim,et al.  Overview of the Fourth Message Understanding Evaluation and Conference , 1992, MUC.

[3]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[4]  Joe F. Zhou,et al.  Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, : 21-22 June 1999, University of Maryland, College Park, MD, USA , 1999 .

[5]  Ellen Riloff,et al.  Creating Subjective and Objective Sentence Classifiers from Unannotated Texts , 2005, CICLing.

[6]  David Fisher,et al.  CRYSTAL: Inducing a Conceptual Dictionary , 1995, IJCAI.

[7]  Hwee Tou Ng,et al.  Closing the Gap: Learning-Based Information Extraction Rivaling Knowledge-Engineering Methods , 2003, ACL.

[8]  Eduard H. Hovy,et al.  Learning surface text patterns for a Question Answering System , 2002, ACL.

[9]  Andrew McCallum,et al.  Information Extraction with HMM Structures Learned by Stochastic Optimization , 2000, AAAI/IAAI.

[10]  Beth M. Sundheim Overview of results of the MUC-6 evaluation , 1995, MUC.

[11]  Oren Etzioni,et al.  Class Extraction from the World Wide Web , 2004 .

[12]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[13]  Philip Resnik,et al.  Mining the Web for Bilingual Text , 1999, ACL.

[14]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[15]  Frank Keller,et al.  Using the Web to Obtain Frequencies for Unseen Bigrams , 2003, CL.

[16]  Satanjeev Banerjee,et al.  The Design, Implementation, and Use of the Ngram Statistics Package , 2003, CICLing.

[17]  Raymond J. Mooney,et al.  Relational Learning of Pattern-Match Rules for Information Extraction , 1999, CoNLL.

[18]  Beth M. Sundheim,et al.  Overview of Results of the MUC-6 Evaluation , 1995, MUC.

[19]  Ellen Riloff,et al.  Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping , 1999, AAAI/IAAI.

[20]  Ellen Riloff,et al.  Automatically Generating Extraction Patterns from Untagged Text , 1996, AAAI/IAAI, Vol. 2.

[21]  Yoram Singer,et al.  Unsupervised Models for Named Entity Classification , 1999, EMNLP.

[22]  Ralph Grishman,et al.  Automatic Acquisition of Domain Knowledge for Information Extraction , 2000, COLING.