DESIGN AND EVALUATION OF ACOUSTIC AND LANGUAGE MODELS FOR LARGE SCALE TELEPHONE

This paper describes the specication, design and development phases of two widely used telephone services based on automatic speech recognition. The eort spent for evaluating and tuning these services will be discussed in detail. In developing the rst service, mainly based on the recognition of \alphanumeric" sequences, a signican t part of the work consisted in rening the acoustic models. To increase recognition accuracy we adopted algorithms and methods consolidated in the past over broadcast news transcription tasks. A signican t result shows that the use of task specic context dependent phone models reduces the word error rate by about 40% relative to using context independent phone models. Note that the latter result was achieved over a small vocabulary task, signican tly dieren t from those generally used in broadcast news transcription. We also investigated both unsupervised and supervised training procedures. Moreover, we studied a novel partly supervised technique that allows us to select in some

[1]  Giuseppe Riccardi,et al.  How may I help you? , 1997, Speech Commun..

[2]  Giuseppe Riccardi,et al.  On-line learning of language models with word error probability distributions , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[3]  Steve J. Young,et al.  Speech recognition evaluation: a review of the U.S. CSR and LVCSR programmes , 1998, Comput. Speech Lang..

[4]  E. Levin,et al.  CHRONUS, The next generation , 1995 .

[5]  Andrea Facco,et al.  On the development of telephone applications: some practical issues and evaluation , 2004, INTERSPEECH.

[6]  Marcello Federico A system for the retrieval of Italian broadcast news , 2000, Speech Commun..

[7]  Renato De Mori,et al.  Spoken Dialogues with Computers , 1998 .

[8]  Jean-Luc Gauvain,et al.  Towards task-independent speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[9]  Jean-Luc Gauvain,et al.  Lightly Supervised Acoustic Model Training , 2000 .

[10]  Daniele Falavigna,et al.  A mixed language model for a dialogue system over ihe telephone , 2000, INTERSPEECH.

[11]  D. Falavigna,et al.  Telephone speech recognition applications at IRST , 1998, Proceedings 1998 IEEE 4th Workshop Interactive Voice Technology for Telecommunications Applications. IVTTA '98 (Cat. No.98TH8376).

[12]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[13]  Marcello Federico,et al.  Broadcast news LM adaptation over time , 2004, Comput. Speech Lang..

[14]  Jean-Luc Gauvain,et al.  Investigating lightly supervised acoustic model training , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[15]  Philip C. Woodland,et al.  The development of the HTK Broadcast News transcription system: An overview , 2002, Speech Commun..

[16]  Steve J. Young,et al.  Talking to machines (statistically speaking) , 2002, INTERSPEECH.

[17]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[18]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[19]  Steve Young,et al.  A data-driven spoken language understanding system , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[20]  David Graff An overview of Broadcast News corpora , 2002, Speech Commun..

[21]  Gerard G. L. Meyer,et al.  Robustness aspects of active learning for acoustic modeling , 2004, INTERSPEECH.

[22]  Gerard G. L. Meyer,et al.  Automatic selection of transcribed training material , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[23]  Jean-Luc Gauvain,et al.  Lightly supervised and unsupervised acoustic model training , 2002, Comput. Speech Lang..

[24]  Daniele Falavigna,et al.  Maximum likelihood endpoint detection with time-domain features , 2003, INTERSPEECH.

[25]  Victor Zue,et al.  JUPlTER: a telephone-based conversational interface for weather information , 2000, IEEE Trans. Speech Audio Process..

[26]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[27]  Roberto Pieraccini,et al.  A stochastic model of human-machine interaction for learning dialog strategies , 2000, IEEE Trans. Speech Audio Process..

[28]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[29]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .