Speaker identification and clustering using convolutional neural networks

Deep learning, especially in the form of convolutional neural networks (CNNs), has triggered substantial improvements in computer vision and related fields in recent years. This progress is attributed to the shift from designing features and subsequent individual sub-systems towards learning features and recognition systems end to end from nearly unprocessed data. For speaker clustering, however, it is still common to use handcrafted processing chains such as MFCC features and GMM-based models. In this paper, we use simple spectrograms as input to a CNN and study the optimal design of those networks for speaker identification and clustering. Furthermore, we elaborate on the question how to transfer a network, trained for speaker identification, to speaker clustering. We demonstrate our approach on the well known TIMIT dataset, achieving results comparable with the state of the art-without the need for handcrafted features.

[1]  Honglak Lee,et al.  Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[2]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Andreas Stolcke,et al.  Artificial neural network features for speaker diarization , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[4]  Benjamin Schrauwen,et al.  End-to-end learning for music audio , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Douglas A. Reynolds,et al.  Speaker identification and verification using Gaussian mixture speaker models , 1995, Speech Commun..

[6]  Bernd Freisleben,et al.  Unfolding speaker clustering potential: a biomimetic approach , 2009, ACM Multimedia.

[7]  Biing-Hwang Juang,et al.  The use of cohort normalized scores for speaker verification , 1992, ICSLP.

[8]  Ahmad Salman,et al.  Learning Speaker-Specific Characteristics With a Deep Neural Architecture , 2011, IEEE Transactions on Neural Networks.

[9]  M. Al-Akaidi Fractal Speech Processing , 2004 .

[10]  Anastasios Tefas,et al.  Multimodal speaker clustering in full length movies , 2015, Multimedia Tools and Applications.

[11]  Jeroen Breebaart,et al.  Features for audio and music classification , 2003, ISMIR.

[12]  Constantine Kotropoulos,et al.  Speaker segmentation and clustering , 2008, Signal Process..

[13]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[14]  Dong Yu,et al.  Exploring convolutional neural network structures and optimization techniques for speech recognition , 2013, INTERSPEECH.

[15]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[16]  Figen Ertaş,et al.  FUNDAMENTALS OF SPEAKER RECOGNITION , 2011 .

[17]  Yu Tsao,et al.  Clustering-based i-vector formulation for speaker recognition , 2014, INTERSPEECH.

[18]  Colin Raffel,et al.  Lasagne: First release. , 2015 .

[19]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[20]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[21]  Gerald Friedland,et al.  The ICSI RT-09 Speaker Diarization System , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Jitendra Ajmera,et al.  A robust speaker clustering algorithm , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[23]  Andrew C. Morris,et al.  PAPER Special Section/Issue on Corpus-Based Speech Technologies GMM based clustering and speaker separability in the Timit speech database , 2005 .

[24]  Colin Raffel,et al.  librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[25]  Yun Lei,et al.  Application of convolutional neural networks to speaker recognition in noisy conditions , 2014, INTERSPEECH.

[26]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[27]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[28]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[29]  Mickael Rouvier,et al.  Speaker diarization through speaker embeddings , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[30]  Simon King,et al.  Where are the challenges in speaker diarization? , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.