Learning embeddings for speaker clustering based on voice equality

Recent work has shown that convolutional neural networks (CNNs) trained in a supervised fashion for speaker identification are able to extract features from spectrograms which can be used for speaker clustering. These features are represented by the activations of a certain hidden layer and are called embeddings. However, previous approaches require plenty of additional speaker data to learn the embedding, and although the clustering results are then on par with more traditional approaches using MFCC features etc., room for improvements stems from the fact that these embeddings are trained with a surrogate task that is rather far away from segregating unknown voices — namely, identifying few specific speakers. We address both problems by training a CNN to extract embeddings that are similar for equal speakers (regardless of their specific identity) using weakly labeled data. We demonstrate our approach on the well-known TIMIT dataset that has often been used for speaker clustering experiments in the past. We exceed the clustering performance of all previous approaches, but require just 100 instead of 590 unrelated speakers to learn an embedding suited for clustering.

[1]  Bernd Freisleben,et al.  Unfolding speaker clustering potential: a biomimetic approach , 2009, ACM Multimedia.

[2]  Andreas Stolcke,et al.  Artificial neural network features for speaker diarization , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[3]  Honglak Lee,et al.  Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[4]  Zsolt Kira,et al.  Deep Image Category Discovery using a Transferred Similarity Function , 2016, ArXiv.

[5]  Mickael Rouvier,et al.  Speaker diarization through speaker embeddings , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[6]  Ahmad Salman,et al.  Learning Speaker-Specific Characteristics With a Deep Neural Architecture , 2011, IEEE Transactions on Neural Networks.

[7]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[8]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Zsolt Kira,et al.  Neural network-based clustering using pairwise constraints , 2015, ArXiv.

[10]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[11]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[12]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Alan McCree,et al.  Speaker diarization with i-vectors from DNN senone posteriors , 2015, INTERSPEECH.

[15]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[17]  Justin Salamon,et al.  Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification , 2016, IEEE Signal Processing Letters.

[18]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[19]  Thomas Hain,et al.  DNN-Based Speaker Clustering for Speaker Diarisation , 2016, INTERSPEECH.

[20]  Douglas A. Reynolds,et al.  Speaker identification and verification using Gaussian mixture speaker models , 1995, Speech Commun..

[21]  Yun Lei,et al.  Application of convolutional neural networks to speaker recognition in noisy conditions , 2014, INTERSPEECH.

[22]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[23]  Constantine Kotropoulos,et al.  Speaker segmentation and clustering , 2008, Signal Process..

[24]  Oliver Durr,et al.  Speaker identification and clustering using convolutional neural networks , 2016, 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP).

[25]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[26]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[27]  Ke Chen,et al.  Extracting Speaker-Specific Information with a Regularized Siamese Deep Network , 2011, NIPS.