论文信息 - Centroid-based Deep Metric Learning for Speaker Recognition

Centroid-based Deep Metric Learning for Speaker Recognition

Speaker embedding models that utilize neural networks to map utterances to a space where distances reflect similarity between speakers have driven recent progress in the speaker recognition task. However, there is still a significant performance gap between recognizing speakers in the training set and unseen speakers. The latter case corresponds to the few-shot learning task, where a trained model is evaluated on unseen classes. Here, we optimize a speaker embedding model with prototypical network loss (PNL), a state-of-the-art approach for the few-shot image classification task. The resulting embedding model outperforms the state-of-the-art triplet loss based models in both speaker verification and identification tasks, for both seen and unseen speakers.

[1] Junichi Yamagishi,et al. SUPERSEDED - CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit , 2016 .

[2] Quan Wang,et al. Generalized End-to-End Loss for Speaker Verification , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] Yann LeCun,et al. Signature Verification Using A "Siamese" Time Delay Neural Network , 1993, Int. J. Pattern Recognit. Artif. Intell..

[4] James Philbin,et al. FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Inderjit S. Dhillon,et al. Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[6] Pietro Perona,et al. One-shot learning of object categories , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7] Patrick Kenny,et al. Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[8] Richard S. Zemel,et al. Prototypical Networks for Few-shot Learning , 2017, NIPS.

[9] Chunlei Zhang,et al. End-to-End Text-Independent Speaker Verification with Triplet Loss on Short Utterances , 2017, INTERSPEECH.

[10] Joshua B. Tenenbaum,et al. Human-level concept learning through probabilistic program induction , 2015, Science.

[11] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12] Matthieu Cord,et al. Learning a Distance Metric from Relative Comparisons between Quadruplets of Images , 2016, International Journal of Computer Vision.

[13] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .

[14] Raquel Urtasun,et al. Dimensionality Reduction for Representing the Knowledge of Probabilistic Models , 2018, International Conference on Learning Representations.

[15] Sanjeev Khudanpur,et al. Deep Neural Network Embeddings for Text-Independent Speaker Verification , 2017, INTERSPEECH.

[16] Yann LeCun,et al. Learning a similarity metric discriminatively, with application to face verification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[17] Xiao Liu,et al. Deep Speaker: an End-to-End Neural Speaker Embedding System , 2017, ArXiv.

[18] Thomas Fillon,et al. YAAFE, an Easy to Use and Efficient Audio Feature Extraction Software , 2010, ISMIR.

[19] Sanjeev Khudanpur,et al. X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20] Joon Son Chung,et al. VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[21] Thorsten Joachims,et al. Learning a Distance Metric from Relative Comparisons , 2003, NIPS.

[22] Driss Matrouf,et al. Study of the Effect of I-vector Modeling on Short and Mismatch Utterance Duration for Speaker Verification , 2012, INTERSPEECH.

[23] Hervé Bredin,et al. TristouNet: Triplet loss for speaker turn embedding , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24] Stefanie Jegelka,et al. Deep Metric Learning via Facility Location , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).