论文信息 - The Design for the Wall Street Journal-based CSR Corpus

The Design for the Wall Street Journal-based CSR Corpus

The DARPA Spoken Language System (SLS) community has long taken a leadership position in designing, implementing, and globally distributing significant speech corpora widely used for advancing speech recognition research. The Wall Street Journal (WSJ) CSR Corpus described here is the newest addition to this valuable set of resources. In contrast to previous corpora, the WSJ corpus will provide DARPA its first general-purpose English, large vocabulary, natural language, high perplexity, corpus containing significant quantities of both speech data (400 hrs.) and text data (47M words), thereby providing a means to integrate speech recognition and natural language processing in application domains with high potential practical value. This paper presents the motivating goals, acoustic data design, text processing steps, lexicons, and testing paradigms incorporated into the multi-faceted WSJ CSR Corpus.

Janet M. Baker | Douglas B. Paul | D. Paul | J. Baker

[1] David B. Pisoni,et al. Text-to-speech: the mitalk system , 1987 .

[2] Slava M. Katz,et al. Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[3] Patti Price,et al. The DARPA 1000-word resource management database for continuous speech recognition , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[4] Mark Liberman,et al. Text on Tap: the ACL/DCI , 1989, HLT.

[5] Janet M. Baker. Dragondictate (TM)-30k: natural language speech recognition with 30, 000 words , 1989, EUROSPEECH.

[6] Louis C. W. Pols. Proceedings of ESCA Tutorial Day and Workshop on Speech Input/Output Assessment and Speech Databases, Noordwijkerhout, The Netherlands, 20-23 September 1989 , 1989 .

[7] Janet M. Baker,et al. On the Interaction Between True Source, Training, and Testing Language Models , 1990, HLT.

[8] Maxine Eskénazi,et al. Design considerations and text selection for BREF, a large French read-speech corpus , 1990, ICSLP.

[9] Hsiao-Wuen Hon,et al. On vocabulary-independent speech modeling , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[10] Douglas B. Paul. Experience with a Stack Decoder-Based HMM CSR and Back-Off N-Gram Language Models , 1991, HLT.