Modelling speaker and channel variability using deep neural networks for robust speaker verification

We propose to improve the performance of i-vector based speaker verification by processing the i-vectors with a deep neural network before they are fed to a cosine distance or probabilistic linear discriminant analysis (PLDA) classifier. To this end we build on an existing model that we refer to as Non-linear Within Class Normalization (NWCN) and introduce a novel Speaker Classifier Network (SCN). Both models deliver impressive speaker verification performance, showing a 56% and 68% relative improvement over standard i-vectors when combined with a cosine distance backend. The NWCN model also reduces the equal error rate for PLDA from 1.78% to 1.63%. We also test these models under the constraints of domain mismatch, i.e. when no in-domain training data is available. Under these conditions, SCN features in combination with cosine distance performs better than the PLDA baseline, achieving an equal error rate of 2.92% as compared to 3.37%.

[1]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[2]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Alan McCree,et al.  Supervised domain adaptation for I-vector based speaker recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  James R. Glass,et al.  Cosine Similarity Scoring without Score Normalization Techniques , 2010, Odyssey.

[5]  Alvin F. Martin,et al.  The NIST 2010 speaker recognition evaluation , 2010, INTERSPEECH.

[6]  Patrick Kenny,et al.  Bayesian Speaker Verification with Heavy-Tailed Priors , 2010, Odyssey.

[7]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[8]  Sergey Novoselov,et al.  On autoencoders in the i-vector space for speaker recognition , 2016, Odyssey.

[9]  Yoshua Bengio,et al.  Practical Recommendations for Gradient-Based Training of Deep Architectures , 2012, Neural Networks: Tricks of the Trade.

[10]  Themos Stafylakis,et al.  Deep Neural Networks for extracting Baum-Welch statistics for Speaker Recognition , 2014, Odyssey.

[11]  Patrick Kenny,et al.  Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification , 2009, INTERSPEECH.

[12]  Andreas Stolcke,et al.  Within-class covariance normalization for SVM-based speaker recognition , 2006, INTERSPEECH.

[13]  Alan McCree,et al.  Improving speaker recognition performance in the domain adaptation challenge using deep neural networks , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[14]  Naim Dahnoun,et al.  2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2014 .

[15]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[16]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[17]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[18]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[19]  Yun Lei,et al.  A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).