Abstract The LDC began its first Broadcast News (BN) speech collection in the spring of 1996, facing a host of challenges including IPR negotiations with broadcasters, establishment of new transcription conventions and tools, and a compressed schedule for creation and release of speech, transcripts and in-domain language model data. The amount of acoustic training data available for participants in the DARPA Hub4 English benchmark tests doubled from 50 h in 1996 to 100 h in 1997, and doubled again to 200 h in 1998. An additional 40 h has been made available as of the summer of 1999. The 1997 benchmark test also saw the addition of BN speech and transcripts in Spanish and Mandarin Chinese, though in lesser quantity, with 30 h of training data in each language. Supplements to the existing pronunciation lexicons in each language were also produced. More recently, the coordinated research project on topic detection and tracking (TDT) has called for a large collection of BN speech data, totaling about 1100 h in English and 300 h in Mandarin over two phases (TDT2 and TDT3), although the level of detail and quality in the TDT transcriptions is not comparable to that of the Hub4 collections.
[1]
William J. Byrne,et al.
Towards language independent acoustic modeling
,
2000,
2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).
[2]
Jonathan G. Fiscus,et al.
1998 Broadcast News Benchmark Test Results: English and Non-English Word Error Rate Performance Measures
,
1998
.
[3]
Yiming Yang,et al.
Topic Detection and Tracking Pilot Study Final Report
,
1998
.
[4]
Ellen M. Voorhees,et al.
The TREC Spoken Document Retrieval Track: A Success Story
,
2000,
TREC.
[5]
Jonathan G. Fiscus,et al.
1997 BROADCAST NEWS BENCHMARK TEST RESULTS: ENGLISH AND NON-ENGLISH
,
1997
.
[6]
Mark Liberman,et al.
THE TDT-2 TEXT AND SPEECH CORPUS
,
1999
.
[7]
Mark Liberman,et al.
A formal framework for linguistic annotation
,
1999,
Speech Commun..