An Empirical Study of Automated Dictionary Construction for Information Extraction in Three Domains

Abstract A primary goal of natural language processing researchers is to develop a knowledge-based natural language processing (NLP) system that is portable across domains. However, most knowledge-based NLP systems rely on a domain-specific dictionary of concepts, which represents a substantial knowledge-engineering bottleneck. We have developed a system called AutoSlog that addresses the knowledge-engineering bottleneck for a task called information extraction . AutoSlog automatically creates domain-specific dictionaries for information extraction, given an appropriate training corpus. We have used AutoSlog to create a dictionary of extraction patterns for terrorism, which achieved 98% of the performance of a hand-crafted dictionary that required approximately 1500 person-hours to build. In this paper, we describe experiments with AutoSlog in two additional domains: joint ventures and microelectronics. We compare the performance of AutoSlog across the three domains, discuss the lessons learned about the generality of this approach, and present results from two experiments which demonstrate that novice users can generate effective dictionaries using AutoSlog.

[1]  Tom Michael Mitchell,et al.  Explanation-based generalization: A unifying view , 1986 .

[2]  Philip J. Hayes,et al.  CONSTRUE/TIS: A System for Content-Based Indexing of a Database of News Stories , 1990, IAAI.

[3]  Claire Cardie,et al.  University of Massachusetts: MUC-3 test results and analysis , 1991, MUC.

[4]  Lisa F. Rau,et al.  GE NLToolset: description of the system as used for MUC-4 , 1992, MUC.

[5]  Douglas E. Appelt,et al.  SRI International: description of the FASTUS system used for MUC-4 , 1992, MUC.

[6]  Ellen Riloff,et al.  Automatically Constructing a Dictionary for Information Extraction Tasks , 1993, AAAI.

[7]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[8]  林 良彦,et al.  Acquiring Lexical Knowledge from Text : A Case Study , 1989 .

[9]  Richard Edward Cullingford,et al.  Script application: computer understanding of newspaper stories. , 1977 .

[10]  Claire Cardie,et al.  UMass/Hughes: Description of the CIRCUS System Used for MUC-51 , 1993, MUC.

[11]  Richard M. Schwartz,et al.  Coping with Ambiguity and Unknown Words through Probabilistic Models , 1993, CL.

[12]  Simonetta Montemagni,et al.  Structural Patterns vs. String Patterns for Extracting Semantic Information from Dictionaries , 1992, COLING.

[13]  Ellen Riloff,et al.  Information extraction as a basis for high-precision text classification , 1994, TOIS.

[14]  Claire Cardie,et al.  University of Massachusetts: MUC-4 test results and analysis , 1992, MUC.

[15]  Richard Granger,et al.  FOUL-UP: A Program that Figures Out Meanings of Words from Context , 1977, IJCAI.

[16]  Herbert Gish,et al.  BBN: description of the PLUM system as used for MUC-4 , 1992, MUC.

[17]  Dan I. Moldovan,et al.  Acquisition of semantic patterns for information extraction from corpora , 1993, Proceedings of 9th IEEE Conference on Artificial Intelligence for Applications.

[18]  Ellen Riloff Information extraction as a basis for portable text classification systems , 1994 .

[19]  Claire Cardie,et al.  University of Massachusetts: description of the CIRCUS system as used for MUC-4 , 1992, MUC.

[20]  Ellen Riloff,et al.  Automatically Acquiring Conceptual Patterns without an Annotated Corpus , 1995, VLC@ACL.

[21]  Claire Cardie,et al.  University of Massachusetts: Description of the CIRCUS System as Used for MUC-3 , 1991, MUC.

[22]  Michael L. Mauldin,et al.  Retrieval performance in Ferret a conceptual information retrieval system , 1991, SIGIR '91.

[23]  Wendy G. Lehnert,et al.  Strategies for Natural Language Processing , 1982 .

[24]  Wendy G. Lehnert,et al.  Symbolic/Subsymbolic Sentence Analysi: Exploiting the Best of Two Worlds , 1988 .

[25]  Jaime G. Carbonell,et al.  Towards a Self-Extending Parser , 1979, ACL.

[26]  Paul E. Utgoff,et al.  ID5: An Incremental ID3 , 1987, ML Workshop.

[27]  Lucy Vanderwende,et al.  Automatically Deriving Structured Knowledge Bases From On-Line Dictionaries , 1993 .

[28]  J. Carbonell Subjective understanding, computer models of belief systems , 1981 .