Data-Driven Speaker Adaptation using Articulatory Features

Real-world speech data usually contains several, distinctly different, speakers and speaking styles, so that methods have to be developed that allow to adapt ASR systems to an individual speaker and his or her speaking style(s). While phone-based approaches have been used in speech recognition and speaker adaptation, this work presents an approach to adaptation using streams of “Articulatory Features” (AFs), which showed great potential in adaptation to different speaking styles. This approach explores the idea of using phonologically distinctive units for discrimination between speech sounds, and is based on models for AFs such as ROUNDED or VOICED. These properties can be detected robustly in speech and can be used to improve discrimination between otherwise confusable words, when full phone models have generally become mis-matched, e.g. due to a different speaking style being used. This paper introduces an automatic procedure to train the free parameters introduced by the feature stream combination on adaptation data using the discriminative Maximum Mutual Information (MMI) criterion and presents results on the English Spontaneous Scheduling Task (ESST)/ Verbmobil phase II (VM-II) for speaker adaptation. On this spontaneous speech task, with a baseline WER of 25.0%, the WER could be reduced to 21.5% using state-independent AF speaker adaptation. State-dependent AF adaptation reaches 19.8% WER while MLLR speaker adaptation using a comparable number of parameters reaches 20.9%. Using speaker-independent AF weights trained on the development test set (i.e. not using AFs for adaptation, but for improving the general performance of the recognizer), the WER on the evaluation set can be reduced by 1.8% absolute, while MLLR adaptation does not improve performance. These results and an initial analysis of the features chosen shows that the AF-based approach successfully captures information which is not available to a purely phonetic approach.

[1]  Peter Beyerlein,et al.  Diskriminative Modellkombination in Spracherkennungssystemen mit großem Wortschatz , 2000 .

[2]  Klaus Ries,et al.  The Karlsruhe-Verbmobil speech recognition engine , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Daniel P. W. Ellis,et al.  Multi-stream speech recognition: ready for prime time? , 1999, EUROSPEECH.

[4]  Katrin Kirchhoff,et al.  Robust speech recognition using articulatory information , 1998 .

[5]  Gerasimos Potamianos,et al.  Discriminative training of HMM stream exponents for audio-visual speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[6]  Florian Metze,et al.  A flexible stream architecture for ASR using articulatory features , 2002, INTERSPEECH.

[7]  Lalit R. Bahl,et al.  Maximum mutual information estimation of hidden Markov model parameters for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Wolfgang Wahlster,et al.  Verbmobil: Foundations of Speech-to-Speech Translation , 2000, Artificial Intelligence.

[9]  A. Waibel,et al.  Multilingual Speech Recognition , 1997 .