An established approach to training Gaussian Mixture Models (GMMs) for speaker verification is via the expectation-maximisation (EM) algorithm. The EM algorithm has been shown to be sensitive to initialisation and prone to converging on local maxima. In exploration of these issues, three different initialisation methods are implemented, along with a split and merge technique to ‘pull’ the trained GMM out of a local maxima. It is shown that both of these approaches improve the likelihood of a GMM trained on speech data. Results of a verification task on the TIMIT and YOHO databases show that increased model fit does not directly translate into an improved equivalent error (EER) rate. In no case does the split and merge procedure improve the EER rate. TIMIT results show a peak in performance of 4.8% EER at 20 EM iterations and a random GMM initialisation. An EER of 1.41% is achieved on the YOHO database under the same regime. It is concluded that running EM to the optimal point of convergence achieves best speaker verification performance, but that this optimal point is dependent on the data and model parameters.
[1]
Geoffrey E. Hinton,et al.
SMEM Algorithm for Mixture Models
,
1998,
Neural Computation.
[2]
Douglas A. Reynolds,et al.
Robust text-independent speaker identification using Gaussian mixture speaker models
,
1995,
IEEE Trans. Speech Audio Process..
[3]
Douglas A. Reynolds,et al.
Speaker identification and verification using Gaussian mixture speaker models
,
1995,
Speech Commun..
[4]
Jeff A. Bilmes,et al.
A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models
,
1998
.
[5]
Michael S. Scordilis,et al.
Optimization of GMM training for speaker verification
,
2004,
Odyssey.
[6]
Douglas A. Reynolds,et al.
A Tutorial on Text-Independent Speaker Verification
,
2004,
EURASIP J. Adv. Signal Process..
[7]
Frédéric Bimbot,et al.
The CAVE Speaker Verification Project - Experiments on the YOHO and SESP Corpora
,
1997,
AVBPA.
[8]
Carla Teixeira Lopes,et al.
TIMIT Acoustic-Phonetic Continuous Speech Corpus
,
2012
.