WSJCAM0: A BRITISH ENGLISH SPEECH CORPUS FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION

A signiicant new speech corpus of British English has been recorded at Cambridge University. Derived from the Wall Street Journal text corpus, WSJCAM0 constitutes one of the largest corpora of spoken British English currently in existence. It has been speciically designed for the construction and evaluation of speaker-independent speech recognition systems. The database consists of 140 speakers each speaking about 110 utterances. This paper describes the motivation for the corpus , the processes undertaken in its construction and the utilities needed as support tools. All utterance transcriptions have been veriied and a phonetic dictionary has been developed to cover the training data and evaluation tasks. Two evaluation tasks have been deened using standard 5,000 word bigram and 20,000 word trigram language models. The paper concludes with comparative results on these tasks for British and American English.

[1]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[2]  Steve Young,et al.  WSJCAM0 corpus and recording description , 1994 .

[3]  Steve Renals,et al.  Recent improvements to the ABBOT large vocabulary CSR system , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.