论文信息 - Improved i-Vector Representation for Speaker Diarization

Improved i-Vector Representation for Speaker Diarization

This paper proposes using a previously well-trained deep neural network (DNN) to enhance the i-vector representation used for speaker diarization. In effect, we replace the Gaussian mixture model typically used to train a universal background model (UBM), with a DNN that has been trained using a different large-scale dataset. To train the T-matrix, we use a supervised UBM obtained from the DNN using filterbank input features to calculate the posterior information and then MFCC features to train the UBM instead of a traditional unsupervised UBM derived from single features. Next we jointly use DNN and MFCC features to calculate the zeroth- and first-order Baum–Welch statistics for training an extractor from which we obtain the i-vector. The system will be shown to achieve a significant improvement on the NIST 2008 speaker recognition evaluation telephone data task compared to state-of-the-art approaches.

Yan Song | Ian Vince McLoughlin | Yan Xu | Kui Wu

[1] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[2] Yun Lei,et al. Application of convolutional neural networks to speaker recognition in noisy conditions , 2014, INTERSPEECH.

[3] Hui Jiang,et al. Investigation on dimensionality reduction of concatenated features with deep neural network for LVCSR systems , 2012, 2012 IEEE 11th International Conference on Signal Processing.

[4] Dong Yu,et al. Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[5] Patrick Kenny,et al. Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[6] S. Chen,et al. Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[7] Douglas A. Reynolds,et al. Diarization of Telephone Conversations Using Factor Analysis , 2010, IEEE Journal of Selected Topics in Signal Processing.

[8] Pietro Laface,et al. Stream-based speaker segmentation using speaker factors and eigenvoices , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9] Yun Lei,et al. A deep neural network speaker verification system targeting microphone speech , 2014, INTERSPEECH.

[10] Jean-Luc Gauvain,et al. Partitioning and transcription of broadcast news data , 1998, ICSLP.

[11] Jitendra Ajmera,et al. A robust speaker clustering algorithm , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[12] Douglas A. Reynolds,et al. An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[13] Yan Song,et al. Robust Sound Event Classification Using Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14] Yan Song,et al. i-vector representation based on bottleneck features for language identification , 2013 .

[15] Li-Rong Dai,et al. Intra-conversation intra-speaker variability compensation for speaker clustering , 2012, 2012 8th International Symposium on Chinese Spoken Language Processing.

[16] Dong Yu,et al. Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[17] James R. Glass,et al. Exploiting Intra-Conversation Variability for Speaker Diarization , 2011, INTERSPEECH.

[18] Yun Lei,et al. Application of Convolutional Neural Networks to Language Identification in Noisy Conditions , 2014, Odyssey.