Mixed-Initiative Development of Language Processing Systems

Historically, tailoring language processing systems to specific domains and languages for which they were not originally built has required a great deal of effort. Recent advances in corpus-based manual and automatic training methods have shown promise in reducing the time and cost of this porting process. These developments have focused even greater attention on the bottleneck of acquiring reliable, manually tagged training data. This paper describes a new set of integrated tools, collectively called the Alembic Workbench, that uses a mixed-initiative approach to "bootstrapping" the manual tagging process, with the goal of reducing the overhead associated with corpus development. Initial empirical studies using the Alembic Workbench to annotate "named entities" demonstrates that this approach can approximately double the production rate. As an added benefit, the combined efforts of machine and user produce domain specific annotation rules that can be used to annotate similar texts automatically through the Alembic-NLP system. The ultimate goal of this project is to enable end users to generate a practical domain-specific information extraction system within a single session.