论文信息 - Modelling speaker and channel variability using deep neural networks for robust speaker verification

Modelling speaker and channel variability using deep neural networks for robust speaker verification

We propose to improve the performance of i-vector based speaker verification by processing the i-vectors with a deep neural network before they are fed to a cosine distance or probabilistic linear discriminant analysis (PLDA) classifier. To this end we build on an existing model that we refer to as Non-linear Within Class Normalization (NWCN) and introduce a novel Speaker Classifier Network (SCN). Both models deliver impressive speaker verification performance, showing a 56% and 68% relative improvement over standard i-vectors when combined with a cosine distance backend. The NWCN model also reduces the equal error rate for PLDA from 1.78% to 1.63%. We also test these models under the constraints of domain mismatch, i.e. when no in-domain training data is available. Under these conditions, SCN features in combination with cosine distance performs better than the PLDA baseline, achieving an equal error rate of 2.92% as compared to 3.37%.

Patrick Kenny | Gautam Bhattacharya | Vishwa Gupta | Md. Jahangir Alam

[1] Florin Curelaru,et al. Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[2] Pascal Vincent,et al. Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3] Alan McCree,et al. Supervised domain adaptation for I-vector based speaker recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4] James R. Glass,et al. Cosine Similarity Scoring without Score Normalization Techniques , 2010, Odyssey.

[5] Alvin F. Martin,et al. The NIST 2010 speaker recognition evaluation , 2010, INTERSPEECH.

[6] Patrick Kenny,et al. Bayesian Speaker Verification with Heavy-Tailed Priors , 2010, Odyssey.

[7] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .

[8] Sergey Novoselov,et al. On autoencoders in the i-vector space for speaker recognition , 2016, Odyssey.

[9] Yoshua Bengio,et al. Practical Recommendations for Gradient-Based Training of Deep Architectures , 2012, Neural Networks: Tricks of the Trade.

[10] Themos Stafylakis,et al. Deep Neural Networks for extracting Baum-Welch statistics for Speaker Recognition , 2014, Odyssey.

[11] Patrick Kenny,et al. Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification , 2009, INTERSPEECH.

[12] Andreas Stolcke,et al. Within-class covariance normalization for SVM-based speaker recognition , 2006, INTERSPEECH.

[13] Alan McCree,et al. Improving speaker recognition performance in the domain adaptation challenge using deep neural networks , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[14] Naim Dahnoun,et al. 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2014 .

[15] Guigang Zhang,et al. Deep Learning , 2016, Int. J. Semantic Comput..

[16] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[17] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[18] Geoffrey E. Hinton,et al. Deep Learning , 2015, Nature.

[19] Yun Lei,et al. A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).