论文信息 - Utterance partitioning for speaker recognition: an experimental review and analysis with new findings under GMM-SVM framework

Utterance partitioning for speaker recognition: an experimental review and analysis with new findings under GMM-SVM framework

The performance of speaker recognition system is highly dependent on the duration of speech used in enrollment and test. This work presents a detailed experimental review and analysis of the GMM-SVM based speaker recognition system in presence of duration variability. This article also reports a comparison of the performance of GMM-SVM classifier with its precursor technique Gaussian mixture model- universal background model (GMM-UBM) classifier in presence of duration variability. The goal of this research work is not to propose a new algorithm for improving speaker recognition performance in presence of duration variability. However, the main focus of this work is on utterance partitioning (UP), a commonly used strategy to compensate the duration variability issue. We have analysed in detailed the impact of training utterance partitioning in speaker recognition performance under GMM-SVM framework. We further investigate the reason why the utterance partitioning is important for boosting speaker recognition performance. We have also shown in which case the utterance partitioning could be useful and where not. Our study has revealed that utterance partitioning does not reduce the data imbalance problem of the GMM-SVM classifier as claimed in earlier study. Apart from these, we also discuss issues related to the impact of parameters such as number of Gaussians, supervector length, amount of splitting required for obtaining better performance in short and long duration test conditions from speech duration perspective. We have performed the experiments with telephone speech from POLYCOST corpus consisting of 130 speakers.

[1] Rosa González Hautamäki,et al. Acoustical and perceptual study of voice disguise by age modification in speaker verification , 2017, Speech Commun..

[2] Sahidullah. Enhancement of Speaker Recognition Performance Using Block Level, Relative and Temporal Information of Subband Energies , 2014 .

[3] Sridha Sridharan,et al. A study on the effects of using short utterance length development data in the design of GPLDA speaker verification systems , 2017, International Journal of Speech Technology.

[4] Man-Wai Mak,et al. Boosting the Performance of I-Vector Based Speaker Verification via Utterance Partitioning , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[5] Chih-Jen Lin,et al. LIBSVM: A library for support vector machines , 2011, TIST.

[6] Stan Davis,et al. Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[7] Tomi Kinnunen,et al. Spectral Features for Automatic Text-Independent Speaker Recognition , 2003 .

[8] Douglas A. Reynolds,et al. Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[9] M. Kubát. An Introduction to Machine Learning , 2017, Springer International Publishing.

[10] Christopher J. C. Burges,et al. A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[11] Lukás Burget,et al. Full-covariance UBM and heavy-tailed PLDA in i-vector speaker verification , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Goutam Saha,et al. Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition , 2012, Speech Commun..

[13] Tomi Kinnunen,et al. Comparative evaluation of maximum a Posteriori vector quantization and gaussian mixture models in speaker verification , 2009, Pattern Recognit. Lett..

[14] Sanjeev Khudanpur,et al. X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] John H. L. Hansen,et al. Speaker Recognition by Machines and Humans: A tutorial review , 2015, IEEE Signal Processing Magazine.

[16] Douglas E. Sturim,et al. Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[17] Vladimir Naumovich Vapni. The Nature of Statistical Learning Theory , 1995 .

[18] Sridha Sridharan,et al. Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques , 2014, Speech Commun..

[19] Gérard Chollet,et al. Support Vector Gmms for Speaker Verification , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[20] Man-Wai Mak,et al. Utterance partitioning with acoustic vector resampling for GMM-SVM speaker verification , 2011, Speech Commun..

[21] Jeff A. Bilmes,et al. A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models , 1998 .

[22] Haizhou Li,et al. An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[23] Dominique Genoud,et al. POLYCOST: A telephone-speech database for speaker recognition , 2000, Speech Commun..

[24] Patrick Kenny,et al. Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[25] Vladimir N. Vapnik,et al. The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[26] Goutam Saha,et al. Speaker verification with short utterances: a review of challenges, trends and opportunities , 2017, IET Biom..

[27] Nicholas W. D. Evans,et al. Influence of task duration in text-independent speaker verification , 2007, INTERSPEECH.

[28] Douglas E. Sturim,et al. SVM Based Speaker Verification using a GMM Supervector Kernel and NAP Variability Compensation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[29] Ethem Alpaydin,et al. Introduction to machine learning , 2004, Adaptive computation and machine learning.

[30] M. Picheny,et al. Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .