Orthogonal Training for Text-Independent Speaker Verification

In this paper we propose orthogonal training schemes to improve the effectiveness of cosine similarity measurements in text-independent speaker verification (SV) tasks. Compared to the PLDA backend, cosine similarity is simple to compute, and it does not require extra data or time to build a separate model. The use of cosine similarity measurement is also highly desirable for building end-to-end SV systems. However, the cosine similarity has an underlying assumption that the dimensions of the speaker embeddings are orthogonal, which is usually not satisfied in current SV systems. The training scheme applies singular vector decomposition (SVD) to the weight matrix of the speaker embedding extraction layer in our time delay neural network (TDNN)-based SV system, and replaces the original weight matrix by the matrix constructed from the left unitary matrix and the singular value matrix. Then the reconstructed matrix in the extraction layer is held constant and the remaining network is fine-tuned with an orthogonality regularizer. We further investigate orthogonal training from scratch, with orthogonality regularization incorporated throughout the network training. Experimental results show that our orthogonal training methods can significantly improve the system performance with a cosine similarity backend.

[1]  Basura Fernando,et al.  Generalized BackPropagation, Étude De Cas: Orthogonality , 2016, ArXiv.

[2]  Jiri Matas,et al.  All you need is a good init , 2015, ICLR.

[3]  Yiming Wang,et al.  Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks , 2018, INTERSPEECH.

[4]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[6]  Xiao Liu,et al.  Deep Speaker: an End-to-End Neural Speaker Embedding System , 2017, ArXiv.

[7]  Georg Heigold,et al.  End-to-end text-dependent speaker verification , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Xiaohan Chen,et al.  Can We Gain More from Orthogonality Regularizations in Training Deep CNNs? , 2018, NeurIPS.

[9]  Yifan Gong,et al.  End-to-End attention based text-dependent speaker verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[10]  Sanjeev Khudanpur,et al.  State-of-the-Art Speaker Recognition for Telephone and Video Speech: The JHU-MIT Submission for NIST SRE18 , 2019, INTERSPEECH.

[11]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[12]  Dacheng Tao,et al.  Improving Training of Deep Neural Networks via Singular Value Bounding , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Shuai Wang,et al.  Margin Matters: Towards More Discriminative Deep Neural Network Embeddings for Speaker Recognition , 2019, 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[14]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Yifan Sun,et al.  SVDNet for Pedestrian Retrieval , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[16]  Aaron Lawson,et al.  The 2016 Speakers in the Wild Speaker Recognition Evaluation , 2016, INTERSPEECH.

[17]  Daniel Povey,et al.  Self-Attentive Speaker Embeddings for Text-Independent Speaker Verification , 2018, INTERSPEECH.

[18]  Sergey Ioffe,et al.  Probabilistic Linear Discriminant Analysis , 2006, ECCV.

[19]  F. Xavier Roca,et al.  Regularizing CNNs with Locally Constrained Decorrelations , 2016, ICLR.

[20]  Minh N. Do,et al.  Special paraunitary matrices, Cayley transform, and multidimensional orthogonal filter banks , 2006, IEEE Transactions on Image Processing.

[21]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.