1996 PRELIMINARY BROADCAST NEWS BENCHMARK TESTS

This paper documents use of Broadcast News test materials in DARPA-sponsored Automatic Speech Recognition (ASR) Benchmark Tests conducted late in 1996. In this year’ s tests, the source materials were broadened to incorporate both television and radio news broadcasts. A form of “partitioned evaluation” (PE) testing was implemented for the first time. At three sites, an additional testing protocol -similar to that used in last year’ s “Dry Run” tests [1] -was used, now termed an “ Unpartitioned Evaluation” (UE). Participants in these tests included nine groups at eight sites: BBN Systems and Technologies, Cambridge University (two groups), Carnegie Mellon University, IBM, LIMSI, New York University, Rutgers University, and SRI International. Evaluation Test Set Word Error Rates are reported for the complete evaluation test set, drawn from 4 news broadcasts (2 radio and 2 TV), and for each “Focus Condition”, corresponding to seven pre-defined subsets of similarlyannotated data. For the system with the lowest measured word error rate, the word error rate for the complete test set was 27.1%, with error rates for the focus conditions ranging from 20.3% to 46.1%. The error rates for “ found speech” vary dramatically throughout the course of a broadcast news segment, and from one segment to another, so that the test set word error rates tell only a portion of the story, and each test set -and subset -has its own properties. These factors are discussed at some length. 1. TRAINING AND TEST MATERIALS The data used in this research program, and the source of the test materials, were collected by the staff of the Linguistic Data Consortium (LDC). The process of recording, digitization, and transcription this corpus is described in another paper in this Proceedings [2.] Approximately 50 hours of recorded radio and TV newscasts were made available for system training purposes. NIST distributed these data (on sets of 20 CD-ROMs) to a community of researchers expressing tentative interest in participating in these tests, after receiving permission to do so from the LDC. In addition to the eight sites that participated in the tests, four more sites received the development test materials, but declined to participate in the 1996 Benchmark Tests. Additional data (amounting to a total of 20 hours) were also provided by the LDC to NIST for potential use as development and evaluation test materials. NIST collaborated with the LDC and with representatives of the DoD to review and revise the annotation and transcription of these materials. NIST also selected and distributed both a development test set and an evaluation test set. These efforts are described in another paper in this Proceedings [3]. 2. TEST PARADIGM AND SCORING Nine different research groups, at eight sites, participated in these tests -BBN Systems and Technologies, Carnegie Mellon University, England’s Cambridge University Engineering Department’s “Connectionist” and “HTK” groups, IBM’s T.J. Watson Laboratories, France’s LIMSI group, a collaborative effort involving New York University and SRI International, Rutgers University, and SRI International. Three of these sites (BBN, CMU, and IBM) had also participated in last year’ s Hub 4 “Dry Run” Broadcast Materials benchmark tests. Discussion of the properties of the systems used for these tests are contained in other papers in this Proceedings. The “ Partitioned Evaluation” test paradigm meant that it was not necessary to develop and implement usage of a “segmenter” or “ chopper” software module. For the “Unpartitioned Evaluation”, as in last year’ s Hub 4 tests, such a module was required. The three sites that participated in both the 1995 and 1996 tests (BBN, CMU, and IBM) also provided UE test results, to complement and contrast with the PE system results. Richard Stern served to chair a Working Group including representatives of potential test participants This Working Group defined the test protocol that was implemented as described in another paper in these Proceedings [4]. The scoring procedures for this year's evaluation followed last year’ s procedures with a few changes. As in last year's test, each ASR system output a “begin time” and “duration” for each recognized word. The ASR system’s results were aligned and scored against time-marked “partitioned segments”, using NIST's SCLITE scoring package. On average, the partitioned segments used in scoring were 54 words in length. Before scoring, both the ASR system output and reference transcripts were pre-filtered using orthographic transformation The number of reference word tokens per speaker, overall, rules. The rules fall into four classes: (1) alternate standard varies from 20 words to 1797 words. Note also that in some of spelli ngs, (2) spelli ng errors in the training transcripts, (3) the focus conditions, there are particularly small samples (i.e., compound words, and (4) contractions. Rules for expansion of note that the number of Bob Dole’ s data categorized as “under contractions were applied only to the hypothesis transcripts. See degraded acoustic conditions” involves only 7 reference words). the discussion on “Orthographic Transformations” in another The total number of reference words in the several focus paper in these Proceedings [3]. conditions ranges from a low of 299 words in the non-native New to this evaluation were the following. broadcast speech focus condition. 1) Regions of overlapping speech were hand marked in the This attribute -nonuniform representation of the data in the reference transcripts and automatically ignored during the various focus conditions -is characteristic of these “ found scoring process. speech” data, and must be recognized when reviewing the 2) Contractions were scored against their correct expanded form. This necessitated hand labeling the reference contractions Table 2 presents a summary report for the systems participating to denote each contraction’ s correct expanded form, using in the Partitioned Evaluation Benchmark Tests. The numbers context to disambiguate possible expansions. tabulated are those corresponding to the related test set (or 3) Spoken word fragments in the reference transcript could word error rate shown for the ibm1 system in Table 1, and match either nothing, or a hypothesized word. Since the discussed in a previous paragraph, also appears in this table.) fragment notation contains only a best guess at the sequence of These are perhaps the most frequently cited “numbers” for letters spoken, fragments were counted as correct if the these tests. Table 2(a) presents data for the complete test set and fragment's text substring matched the beginning substring of the each of the focus conditions, and Table 2(b) presents data, in hypothesized word. For example, the reference fragment "fr-" addition, for each of the test set’ s component broadcasts. would match "frank" but not "find".