Previous work on automated dictionary construction for information extraction has relied on annotated text corpora. However, annotating a corpus is time-consuming and difficult. We propose that conceptual patterns for information extraction can be acquired automatically using only a preclassified training corpus and no text annotations. We describe a system called AutoSlog-TS, which is a variation of our previous AutoSlog system, that runs exhaustively on an untagged text corpus. Text classification experiments in the MUC-4 terrorism domain show that the AutoSlog-TS dictionary performs comparably to a hand-crafted dictionary, and actually achieves higher precision on one test set. For text classification, AutoSlog-TS requires no manual effort beyond the preclassified training corpus. Additional experiments suggest how a dictionary produced by AutoSlog-TS can be filtered automatically for information extraction tasks. Some manual intervention is still required in this case, but AutoSlog-TS significantly reduces the amount of effort required to create an appropriate training corpus. 1 I n t r o d u c t i o n In the last few years, significant progress has been made toward automatically acquiring conceptual patterns for information extraction (e.g., [Riloff, 1993; Kim and Moldovan, 1993]). However, previous approaches require an annotated training corpus or some other type of manually encoded training data. Annota ted training corpora are expensive to build, both in terms of the time and the expertise required to create them. Furthermore, training corpora for information extraction are typically annota ted with domain-specific tags, in contrast to general-purpose annotations such as part-of-speech tags or noun-phrase bracketing (e.g., the Brown Corpus [Francis and Kucera, 1982] and the Penn Treebank [Marcus et al., 1993]). Consequently, a new training corpus must be annotated for each domain. We have begun to explore the possibility of using an untagged corpus to automatically acquire conceptual pat terns for information extraction. Our approach uses a combination of domainindependent linguistic rules and statistics. The linguistic rules are based on our previous system, AutoSlog [Riloff, 1993], which automatically constructs dictionaries for information extraction using an annotated training corpus. We have put a new spin on the original system by applying it exhaustively to an untagged but preclassified training corpus (i.e., a corpus in which the texts have been manually classified as either relevant or irrelevant). Statistics are then used to sift through the myriad of pat terns that it produces. The new system, AutoSlog-TS, can generate a conceptual dictionary of extraction pat terns for a domain from a preclassified text corpus.
[1]
Richard Granger,et al.
FOUL-UP: A Program that Figures Out Meanings of Words from Context
,
1977,
IJCAI.
[2]
Jaime G. Carbonell,et al.
Towards a Self-Extending Parser
,
1979,
ACL.
[3]
R. Burchfield.
Frequency Analysis of English Usage: Lexicon and Grammar. By W. Nelson Francis and Henry Kučera with the assistance of Andrew W. Mackie. Boston: Houghton Mifflin. 1982. x + 561
,
1985
.
[4]
Paul S. Jacobs,et al.
Acquiring Lexical Knowledge from Text: A Case Study
,
1988,
AAAI.
[5]
Wendy G. Lehnert,et al.
Symbolic/Subsymbolic Sentence Analysi: Exploiting the Best of Two Worlds
,
1988
.
[6]
Bruce W. Ballard,et al.
Proceedings of the second conference on Applied natural language processing
,
1988
.
[7]
Kenneth Ward Church.
A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text
,
1988,
ANLP.
[8]
Claire Cardie,et al.
University of Massachusetts: Description of the CIRCUS System as Used for MUC-4
,
1992,
MUC.
[9]
Beatrice Santorini,et al.
Building a Large Annotated Corpus of English: The Penn Treebank
,
1993,
CL.
[10]
Richard M. Schwartz,et al.
Coping with Ambiguity and Unknown Words through Probabilistic Models
,
1993,
CL.
[11]
Ellen Riloff,et al.
Automatically Constructing a Dictionary for Information Extraction Tasks
,
1993,
AAAI.
[12]
Dan I. Moldovan,et al.
Acquisition of semantic patterns for information extraction from corpora
,
1993,
Proceedings of 9th IEEE Conference on Artificial Intelligence for Applications.
[13]
Ellen Riloff.
Information extraction as a basis for portable text classification systems
,
1994
.
[14]
Ellen Riloff,et al.
Information extraction as a basis for high-precision text classification
,
1994,
TOIS.