Learning Speaker Representations with Mutual Information

Learning good representations is of crucial importance in deep learning. Mutual Information (MI) or similar measures of statistical dependence are promising tools for learning these representations in an unsupervised way. Even though the mutual information between two random variables is hard to measure directly in high dimensional spaces, some recent studies have shown that an implicit optimization of MI can be achieved with an encoder-discriminator architecture similar to that of Generative Adversarial Networks (GANs). In this work, we learn representations that capture speaker identities by maximizing the mutual information between the encoded representations of chunks of speech randomly sampled from the same sentence. The proposed encoder relies on the SincNet architecture and transforms raw speech waveform into a compact feature vector. The discriminator is fed by either positive samples (of the joint distribution of encoded chunks) or negative samples (from the product of the marginals) and is trained to separate them. We report experiments showing that this approach effectively learns useful speaker representations, leading to promising results on speaker identification and verification tasks. Our experiments consider both unsupervised and semi-supervised settings and compare the performance achieved with different objective functions.

[1]  David Applebaum,et al.  Probability and Information: An Integrated Approach , 2008 .

[2]  Tara N. Sainath,et al.  Learning the speech front-end with raw waveform CLDNNs , 2015, INTERSPEECH.

[3]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[4]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Titouan Parcollet,et al.  The Pytorch-kaldi Speech Recognition Toolkit , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Hermann Ney,et al.  Acoustic modeling with deep neural networks using raw time signal for LVCSR , 2014, INTERSPEECH.

[7]  Yoshua Bengio,et al.  Mutual Information Neural Estimation , 2018, ICML.

[8]  Douglas A. Reynolds,et al.  Deep Neural Network Approaches to Speaker and Language Recognition , 2015, IEEE Signal Processing Letters.

[9]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[10]  Bhuvana Ramabhadran,et al.  Invariant Representations for Noisy Speech Recognition , 2016, ArXiv.

[11]  Dimitri Palaz,et al.  Analysis of CNN-based speech recognition system using raw speech as input , 2015, INTERSPEECH.

[12]  Yun Lei,et al.  Advances in deep neural network approaches to speaker recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Yoshua Bengio,et al.  A network of deep neural networks for Distant Speech Recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[15]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[16]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Sébastien Marcel,et al.  Towards Directly Modeling Raw Speech Signal for Speaker Verification Using CNNS , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Lalit R. Bahl,et al.  Maximum mutual information estimation of hidden Markov model parameters for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Yoshua Bengio,et al.  Speaker Recognition from Raw Waveform with SincNet , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[22]  J. Kinney,et al.  Equitability, mutual information, and the maximal information coefficient , 2013, Proceedings of the National Academy of Sciences.

[23]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[24]  Maurizio Omologo,et al.  Contaminated speech training methods for robust DNN-HMM distant speech recognition , 2017, INTERSPEECH.

[25]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[26]  Maurizio Omologo,et al.  The DIRHA-ENGLISH corpus and related tasks for distant-speech recognition in domestic environments , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[27]  Xiao Liu,et al.  Deep Speaker: an End-to-End Neural Speaker Embedding System , 2017, ArXiv.

[28]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[29]  Yoshua Bengio,et al.  Interpretable Convolutional Filters with SincNet , 2018, ArXiv.

[30]  Dong Yu,et al.  Automatic Speech Recognition: A Deep Learning Approach , 2014 .

[31]  Maurizio Omologo,et al.  Impulse response estimation for robust speech recognition in a reverberant environment , 2012, 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO).

[32]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Yoshua Bengio,et al.  Learning deep representations by mutual information estimation and maximization , 2018, ICLR.

[34]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[35]  Mirco Ravanelli,et al.  Deep Learning for Distant Speech Recognition , 2017, ArXiv.

[36]  Jean-Marc Odobez,et al.  Robust and Discriminative Speaker Embedding via Intra-Class Distance Variance Regularization , 2018, INTERSPEECH.

[37]  Pietro Liò,et al.  Deep Graph Infomax , 2018, ICLR.

[38]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[39]  Driss Matrouf,et al.  Study of the Effect of I-vector Modeling on Short and Mismatch Utterance Duration for Speaker Verification , 2012, INTERSPEECH.

[40]  Yoshua Bengio,et al.  Learning Independent Features with Adversarial Nets for Non-linear ICA , 2017, 1710.05050.

[41]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[42]  Panayiotis G. Georgiou,et al.  Neural Predictive Coding Using Convolutional Neural Networks Toward Unsupervised Learning of Speaker Characteristics , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[43]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[44]  Yoshua Bengio,et al.  Batch-normalized joint training for DNN-based distant speech recognition , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).