论文信息 - The design of a large vocabulary speech corpus for portuguese

The design of a large vocabulary speech corpus for portuguese

The last years show a great development of large vocabulary, speaker-independent continuous speech recognition systems and some research in multilingual aspects. To allow that development to also be extended to the European Portuguese language we decided to develop and collect a large database of continuous speech based on a large amount of text. In the development of this new Portuguese database our aim was to create a corpus equivalent in size to WSJ0. We selected the database texts from the P UBLICO newspaper, which is characterized by a broad coverage of matters and di erent writing styles. The recording population was selected from a large engineering school, assuring a large variability of speakers. The recordings are being done as we write this paper and we expect to release the database in CD format in September 1997.

Ciro Martins | João Paulo da Silva Neto | Luís B. Almeida | Hugo Meinedo

[1] Ciro Martins,et al. The development of a speaker independent continuous speech recognizer for portuguese , 1997, EUROSPEECH.

[2] J. Foote,et al. WSJCAM0: A BRITISH ENGLISH SPEECH CORPUS FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION , 1995 .

[3] Steve Renals,et al. WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[4] Maxine Eskénazi,et al. BREF, a large vocabulary spoken corpus for French , 1991, EUROSPEECH.

[5] Lori Lamel,et al. Issues in Large Vocabulary, Multilingual Speech Recognition , 1995, EUROSPEECH.

[6] Ciro Martins,et al. Speaker-adaptation in a hybrid HMM-MLP recognizer , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[7] Janet M. Baker,et al. The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[8] G. G. Stokes. "J." , 1890, The New Yale Book of Quotations.