TEG: a hybrid approach to information extraction

This paper describes a hybrid statistical and knowledge-based information extraction model, able to extract entities and relations at the sentence level. The model attempts to retain and improve the high accuracy levels of knowledge-based systems while drastically reducing the amount of manual labor by relying on statistics drawn from a training corpus. The implementation of the model, called TEG (Trainable Extraction Grammar), can be adapted to any IE domain by writing a suitable set of rules in a SCFG (Stochastic Context Free Grammar) based extraction language, and training them using an annotated corpus. The system does not contain any purely linguistic components, such as PoS tagger or parser. We demonstrate the performance of the system on several named entity extraction and relation extraction tasks. The experiments show that our hybrid approach outperforms both purely statistical and purely knowledge-based systems, while requiring orders of magnitude less manual rule writing and smaller amount of training data. The improvement in accuracy is slight for named entity extraction task and more pronounced for relation extraction.

[1]  Hwee Tou Ng,et al.  Named Entity Recognition: A Maximum Entropy Approach Using Global Information , 2002, COLING.

[2]  Steffen Lange,et al.  A Unifying Approach to HTML Wrapper Representation and Learning , 2000, Discovery Science.

[3]  Alexander A. Morgan,et al.  Background and overview for KDD Cup 2002 task 1: information extraction from biomedical articles , 2002, SKDD.

[4]  Walter Daelemans,et al.  Information Extraction via Double Classification , 2003 .

[5]  Brian Roark,et al.  Efficient probabilistic top-down and left-corner parsing , 1999, ACL.

[6]  Eugene Charniak,et al.  A Maximum-Entropy-Inspired Parser , 2000, ANLP.

[7]  Nicholas Kushmerick,et al.  Finite-State Approaches to Web Information Extraction , 2002, SCIE.

[8]  Richard M. Schwartz,et al.  An Algorithm that Learns What's in a Name , 1999, Machine Learning.

[9]  Michael Collins,et al.  Semantic Tagging using a Probabilistic Context Free Grammar , 1998, VLC@COLING/ACL.

[10]  Richard M. Schwartz,et al.  BBN: Description of the SIFT System as Used for MUC-7 , 1998, MUC.

[11]  Christopher D. Manning,et al.  An O(n^3) Agenda-Based Chart Parser for Arbitrary Probabilistic Context-Free Grammars , 2001 .

[12]  Michael Collins,et al.  Three Generative, Lexicalised Models for Statistical Parsing , 1997, ACL.

[13]  Wai Lam,et al.  Using Support Vector Machines for Terrorism Information Extraction , 2003, ISI.

[14]  Dayne Freitag,et al.  Information Extraction from HTML: Application of a General Machine Learning Approach , 1998, AAAI/IAAI.

[15]  Lynette Hirschman,et al.  Evaluating Message Understanding Systems: An Analysis of the Third Message Understanding Conference (MUC-3) , 1993, CL.

[16]  James S. Aitken Learning Information Extraction Rules: An Inductive Logic Programming approach , 2002, ECAI.

[17]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[18]  Andrew McCallum,et al.  Information Extraction with HMMs and Shrinkage , 1999 .

[19]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[20]  Andrew McCallum,et al.  Information Extraction with HMM Structures Learned by Stochastic Optimization , 2000, AAAI/IAAI.

[21]  Yonatan Aumann,et al.  A Comparative Study of Information Extraction Strategies , 2002, CICLing.

[22]  Richard M. Schwartz,et al.  Nymble: a High-Performance Learning Name-finder , 1997, ANLP.

[23]  Kazem Taghva,et al.  Address extraction using hidden Markov models , 2005, IS&T/SPIE Electronic Imaging.