The main problems with HMMs of sub-word units are the large amount of training data and computer time needed for estimating the parameters of the models. In some applications it is not proposable that a new speaker utters many hundreds of words to train the system, hence the interest arises for a quick adaptation based on some tens of training utterances. Two bounds are given for comparison with the results of the speaker adaptation, namely the recognition rates of speaker dependent and cross speaker recognition. Speaker dependent recognition is achieved by training the HMMs with nearly 1000 words uttered by the same speaker used in the tests. Cross speaker recognition, that gives a lower bound to the performance, concerns experiments in which the models were trained by a speaker different from that who uttered the test sentences. An adaptation algorithm using Parzen estimation and interpolation of the emission densities between the new and the old speaker models was investigated. It is able to give satisfactory recognition rates adapting the HMMs on the basis of only 40 training words uttered by the new speaker.
[1]
L. Baum,et al.
A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains
,
1970
.
[2]
Anna Maria Colla,et al.
Automatic diphone bootstrapping for speaker-adaptive continuous speech recognition
,
1984,
ICASSP.
[3]
Roberto Pieraccini,et al.
Syntax driven recognition of connected words by Markov models
,
1984,
ICASSP.
[4]
R. Pieraccini,et al.
Definition and evaluation of phonetic units for speech recognition by hidden Markov models
,
1986,
ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.
[5]
Yuji Kijima,et al.
Speaker adaptation in large-vocabulary voice recognition
,
1984,
ICASSP.
[6]
M. Nishimura,et al.
Speaker adaptation for a hidden Markov model
,
1986,
ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.