The design of a large vocabulary speech corpus for portuguese

The last years show a great development of large vocabulary, speaker-independent continuous speech recognition systems and some research in multilingual aspects. To allow that development to also be extended to the European Portuguese language we decided to develop and collect a large database of continuous speech based on a large amount of text. In the development of this new Portuguese database our aim was to create a corpus equivalent in size to WSJ0. We selected the database texts from the P UBLICO newspaper, which is characterized by a broad coverage of matters and di erent writing styles. The recording population was selected from a large engineering school, assuring a large variability of speakers. The recordings are being done as we write this paper and we expect to release the database in CD format in September 1997.

[1]  Ciro Martins,et al.  The development of a speaker independent continuous speech recognizer for portuguese , 1997, EUROSPEECH.

[2]  J. Foote,et al.  WSJCAM0: A BRITISH ENGLISH SPEECH CORPUS FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION , 1995 .

[3]  Steve Renals,et al.  WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[4]  Maxine Eskénazi,et al.  BREF, a large vocabulary spoken corpus for French , 1991, EUROSPEECH.

[5]  Lori Lamel,et al.  Issues in Large Vocabulary, Multilingual Speech Recognition , 1995, EUROSPEECH.

[6]  Ciro Martins,et al.  Speaker-adaptation in a hybrid HMM-MLP recognizer , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[7]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[8]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.