Training GMMs for speaker verification

An established approach to training Gaussian Mixture Models (GMMs) for speaker verification is via the expectation-maximisation (EM) algorithm. The EM algorithm has been shown to be sensitive to initialisation and prone to converging on local maxima. In exploration of these issues, three different initialisation methods are implemented, along with a split and merge technique to ‘pull’ the trained GMM out of a local maxima. It is shown that both of these approaches improve the likelihood of a GMM trained on speech data. Results of a verification task on the TIMIT and YOHO databases show that increased model fit does not directly translate into an improved equivalent error (EER) rate. In no case does the split and merge procedure improve the EER rate. TIMIT results show a peak in performance of 4.8% EER at 20 EM iterations and a random GMM initialisation. An EER of 1.41% is achieved on the YOHO database under the same regime. It is concluded that running EM to the optimal point of convergence achieves best speaker verification performance, but that this optimal point is dependent on the data and model parameters.