Historically, tailoring language processing systems to specific domains and languages for which they were not originally built has required a great deal of effort. Recent advances in corpus-based manual and automatic training methods have shown promise in reducing the time and cost of this porting process. These developments have focused even greater attention on the bottleneck of acquiring reliable, manually tagged training data. This paper describes a new set of integrated tools, collectively called the Alembic Workbench, that uses a mixed-initiative approach to "bootstrapping" the manual tagging process, with the goal of reducing the overhead associated with corpus development. Initial empirical studies using the Alembic Workbench to annotate "named entities" demonstrates that this approach can approximately double the production rate. As an added benefit, the combined efforts of machine and user produce domain specific annotation rules that can be used to annotate similar texts automatically through the Alembic-NLP system. The ultimate goal of this project is to enable end users to generate a practical domain-specific information extraction system within a single session.
[1]
Marc Vilain,et al.
Validation of Terminological Inference in an Information Extraction Task
,
1993,
HLT.
[2]
Lynette Hirschman,et al.
MITRE: Description of the Alembic System as used in MET
,
1996,
TIPSTER.
[3]
Eric Brill,et al.
A Simple Rule-Based Part of Speech Tagger
,
1992,
HLT.
[4]
David S. Day,et al.
Finite-state phrase parsing by rule sequences
,
1996,
COLING.
[5]
Eric Brill,et al.
A corpus-based approach to language learning
,
1993
.
[6]
Ralph Grishman,et al.
Message Understanding Conference- 6: A Brief History
,
1996,
COLING.
[7]
Shlomo Argamon,et al.
Minimizing Manual Annotation Cost in Supervised Training from Corpora
,
1996,
ACL.