In this paper we report on the first phase of the speech corpus collection for purposes of the ESPRIT LTR project n. 30268, EuTrans. The corpus is intended to provide training material for speaker independent continuous speech recognition and translation over the telephone line, based on a vocabulary of few thousands words. Due to its application the corpus is structured so to contain speech material for acoustic modelling, and textual material for language modelling and translation modelling. The speech material which is being collected, and which we will describe in this paper, has been produced in a natural way. The corpus will be described with the aid of some statistic results obtained to better illustrate the characteristics of the acquired material. We will finally present our future plan for the collection of other parts of the corpus and in particular we will introduce a new "dialogue oriented" collection paradigm.
[1]
Nigel Gilbert,et al.
Simulating speech systems
,
1991
.
[2]
Renato De Mori,et al.
Automatic generation of visual scenarios for spoken corpora acquisition
,
1998,
ICSLP.
[3]
S. Park,et al.
Designing the Human Machine Interface in the ATIS Domain
,
1990,
HLT.
[4]
Ronald Rosenfeld,et al.
Statistical language modeling using the CMU-cambridge toolkit
,
1997,
EUROSPEECH.
[5]
R. De Mori,et al.
Comparative evaluation of spoken corpora acquired by presentation of visual scenarios and textual descriptions
,
1999,
1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).