Toward General-Purpose Learning for Information Extraction

Two trends are evident in the recent evolution of the field of information extraction: a preference for simple, often corpus-driven techniques over linguistically sophisticated ones; and a broadening of the central problem definition to include many non-traditional text domains. This development calls for information extraction systems which are as retargetable and general as possible. Here, we describe SRV, a learning architecture for information extraction which is designed for maximum generality and flexibility. SRV can exploit domain-specific information, including linguistic syntax and lexical information, in the form of features provided to the system explicitly as input for training. This process is illustrated using a domain created from Reuters corporate acquisitions articles. Features are derived from two general-purpose NLP systems, Sleator and Temperly's link grammar parser and Wordnet. Experiments compare the learner's performance with and without such linguistic information. Surprisingly, in many cases, the system performs as well without this information as with it.

[1]  Ellen Riloff,et al.  Automatically Acquiring Conceptual Patterns without an Annotated Corpus , 1995, VLC@ACL.

[2]  Wendy G. Lehnert,et al.  Wrap-Up: a Trainable Discourse Module for Information Extraction , 1994, J. Artif. Intell. Res..

[3]  Wendy G. Lehnert,et al.  Information extraction , 1996, CACM.

[4]  Raymond J. Mooney,et al.  Relational Learning of Pattern-Match Rules for Information Extraction , 1999, CoNLL.

[5]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[6]  S. L. Golbeck,et al.  Representation And Learning , 1988 .

[7]  Franz Josef Radermacher,et al.  Modeling and artificial intelligence , 1991, Appl. Artif. Intell..

[8]  Douglas E. Appelt,et al.  FASTUS: A Finite-state Processor for Information Extraction from Real-world Text , 1993, IJCAI.

[9]  Dayne Freitag,et al.  Information Extraction from HTML: Application of a General Machine Learning Approach , 1998, AAAI/IAAI.

[10]  Ellen Riloff,et al.  Automatically Generating Extraction Patterns from Untagged Text , 1996, AAAI/IAAI, Vol. 2.

[11]  Maria T. Pazienza,et al.  Information Extraction , 2002, Lecture Notes in Computer Science.

[12]  Dayne Freitag,et al.  Using grammatical inference to improve precision in information extraction , 1997, ICML 1997.

[13]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[14]  Daniel Dominic Sleator,et al.  Parsing English with a Link Grammar , 1995, IWPT.

[15]  David D. Lewis,et al.  Representation and Learning in Information Retrieval , 1991 .

[16]  Stephen Glenn Soderland,et al.  Learning text analysis rules for domain-specific natural language processing , 1996 .