NAMED ENTITY EXTRACTION FROM SPEECH: APPROACH AND RESULTS USING THE TEXTPRO SYSTEM

This paper describes the application of the TextPro system to the task of recognition of named entities in speech. TextPro is a lightweight engine for interpreting cascaded finite-state transducers. Although originally intended for processing text, the experience of this evaluation demonstrates the system can easily be adapted to processing transcripts generated by a speech recognizer as well. 1. THE TEXTPRO EXTRACTION SYSTEM For its participation in the Hub4 named-entity identification task, SRI International employed a newly developed information extraction system called TextPro. TextPro is a lightweight interpreter of cascaded finite-state transducers that is based on the TIPSTER Document Manager architecture [Grishman et al., 1996] and the TIPSTER Common Pattern Specification Language1 (CPSL). TextPro finite-state transducers accept and produce sequences of annotations on the document conforming to the structure specified by the TIPSTER document manager architecture. The transducers themselves are expressed by finite-state rules written in CPSL. The grammars employed by the Hub4 name recognizer specified the creation of ENAMEX, NUMEX, and TIMEX annotations, as well as other annotations used by the system internally. After having run each of the cascaded transducers over an input text, a postprocessor would insert SGML markup as required by the rules of the named-entity task. TextPro was originally developed to process text documents, and to test alternative specifications for CPSL. The first author participated in the design committee for CPSL under the TIPSTER program. The program runs on PowerPC Macintosh computers, and is freely downloadable from the World Wide Web.2 Although originally developed for limited objectives, experience led us to conclude that TextPro was a very useful 1 Because of the premature end of the TIPSTER program, the specifications for the Common Pattern Specification Language were never finalized or published. Further information is obtainable from the authors. 2 The URL for obta in ing TextPro is http://www.ai.sri.com/~appelt/TextPro/. system for performing document annotation tasks that do not involve the construction and merging of template structures such as those typical of the MUC scenario template tasks [ref Muc6]. For this reason, and because it is small and extremely fast, we felt that TextPro was a superior alternative to the more well known FASTUS system [Hobbs et al., 1996], which SRI has employed in various MUC evaluations. 1.1. Adapting TextPro to the Hub4 task Although TextPro was originally intended to process newspaper texts, it proved to be very straightforward to process speech transcriptions in Universal Transcription Format, whether human or machine generated. The adaptation process began by translating the FASTUS grammar used for SRI International’s participation in MUC-6 into CPSL. The MUC-6 grammar provided a high-performance baseline to start from; the SRI MUC-6 FASTUS system performed well on the named-entity task, achieving an F-measure of 94. The MUC-6 name recognizer, however, was optimized for mixed case texts, and typical Wall Street Journal articles, and therefore its performance on the Hub4 task was considerably short of optimal. Adapting the grammar to work well with monocase texts, absent any information provided by capitalization in ordinary texts, required the use of large lexicons to indicate which words were likely to be parts of names. The TextPro Hub4 system uses four large lexicons in addition to the lexicons used by the MUC-6 system: 1. A large lexicon of United States place names that was originally distributed with the place-name gazetteer for MUC-5, supplemented by a manually-culled set of foreign place names from the same source. 2 . A proprietary list of person names of many national i t ies obtained from Nuance Communications Corp. 3 3 This proprietary list cannot be given out in a public distribution. It is possible to replace this list with a list of American first and last names culled from census data publicly available on the Web. However, because of the 3. A list of prominent American and multinational corporations. This lexicon was the same one used by SRI for MUC-5 and MUC-6 participation, with the addition of some recently founded corporations. 4. A list of United States government agencies and departments. The lexicon used in this evaluation was expanded considerably over previous versions by using names appearing in the Hub4 training data. After the initial grammars and large lexicons were in place, the next step was to do iterative testing and debugging to raise the level of the system’s performance by refining the rules and lexical entries. Given the high speed of the TextPro system, it was very easy to do runs over the entire set of training data. Ten megabytes of training data could be processed in about two hours on our available hardware. The process of hill climbing on training data was not much different for the Hub4 task than other information extraction tasks in which SRI has participated. A few innovations were necessary to process speech transcription data successfully: · Important discourse contexts, in particular sports report and weather report contexts, were recognized. Sports report contexts were recognized so that names referring to sports teams that were ambiguous with ordinary English words (e.g. “Indians”) would be properly identified. In weather report contexts, it was important to recognize that phrases like “in the sixties” are not temporal expressions. census data’s weaker coverage of foreign names, its performance on the Hub4 task is noticeably lower. · Rules were needed to decompose lists of person-name words into likely combinations of first, middle, and last names. This was very important for lists or conjunctions of person names, and for the frequent situations where names appeared adjacent to a sentence boundary that would not be marked in the speech transcript. · Successful name recognition requires recognizing subsequent references to the same person, particularly when such references involve only the first name or the last name of a previously mentioned person that would not otherwise be tagged as names because of ambiguity with ordinary English words. This strategy works very well, except in the relatively frequent situations in which the speaker utters a fragment, or a repair. These fragments, if incorrectly recognized as names, can cause the erroneous tagging of many words in a text when the fragments are recognized as common one-syllable words. The TextPro Hub4 system used frequency data gleaned from the Penn Treebank Wall Street Journal corpus to limit the recognition of very common words as name parts to those contexts in which their status as names was unambiguous. 1.2. Evaluation Results The table in Figure 1 illustrates the results obtained by the TextPro Hub4 system in the recent evaluation for TextPro applied to the reference transcripts, and for TextPro applied to the output of SRI’s own speech recognition system. For the reference transcripts, and the baseline recognizer output, the TextPro results are very close to the best reported in each category. These evaluation results are quite consistent with the results obtained by SRI during our development testing. For development and testing, we divided the available 10 megabytes of training data furnished by Mitre and BBN into an eightEvaluation Task Content Extent Type Average Reference transcript, Segment 1 0.93 0.87 0.90 0.90 Reference transcript, Segment 2 0.93 0.88 0.91 0.91 SRI recognizer, Segment 1 0.76 0.75 0.79 0.77 SRI recognizer, Segment 2 0.80 0.76 0.81 0.79 SRI <10X Real Time Recognizer 0.76 0.74 0.78 0.76 Figure 1: SRI International’s TextPro tagging results (F-measures) on the Hub4 named entity recognition task megabyte training corpus and a two-megabyte test corpus, which was kept blind. In a final run before the official test, we recorded an average F-measure of 92 for the training data, and 89 for the blind development test data.