Multi-Task Adversarial Network Bottleneck Features for Noise-Robust Speaker Verification

Modern automatic speaker verification (ASV) systems need to be robust under various noisy conditions. Motivated by the success of generative adversarial networks (GANs), this paper proposes a multi-task adversarial network (MAN) for extracting noise-invariant bottleneck (BN) features. The MAN consists of three component networks, a feature encoding network (FEN), a speaker discriminative network (SDN) and a noise-domain adaptation network (NAN). The FEN aims to generate noise-robustness BN features, the SDN makes the features from the FEN more speaker-discriminative and the NAN guides the FEN to learn more noise-invariant feature representations. The MAN is trained using an adversarial method. When training FEN and SDN, speaker identities and the label of being clean speech are used as target labels, which can make BN features, extracted from noisy or clean speech, similar. When training NAN, on the contrary, noise types are used as training targets. We evaluate the newly proposed MAN-BN feature extraction method on a Gaussian mixture model-universal background model (GMM-UBM) based ASV system. The experimental results on the RSR2015 database show that the proposed MAN-BN feature can dramatically improve the accuracy of the ASV system under different noise-type and signal-to-noise-ratio conditions.

[1]  Tomi Kinnunen,et al.  Further optimisations of constant Q cepstral processing for integrated utterance and text-dependent speaker verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[2]  Hynek Hermansky,et al.  Developing a speaker identification system for the DARPA RATS project , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Yun Lei,et al.  Towards noise-robust speaker recognition using probabilistic linear discriminant analysis , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Zheng-Hua Tan,et al.  Speech enhancement using Long Short-Term Memory based recurrent Neural Networks for noise robust Speaker Verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[6]  Bhuvana Ramabhadran,et al.  Invariant Representations for Noisy Speech Recognition , 2016, ArXiv.

[7]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[8]  Aidong Men,et al.  An Improved Ranking-Based Feature Enhancement Approach for Robust Speaker Recognition , 2016, IEEE Access.

[9]  Hagai Aronowitz,et al.  Audio enhancing with DNN autoencoder for speaker recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Pascal Scalart,et al.  Speech enhancement based on a priori signal to noise estimation , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[11]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[12]  Yusuke Shinohara,et al.  Adversarial Multi-Task Learning of Deep Neural Networks for Robust Speech Recognition , 2016, INTERSPEECH.

[13]  Jesper Jensen,et al.  Minimum Mean-Square Error Estimation of Discrete Fourier Coefficients With Generalized Gamma Priors , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Alan Ritter,et al.  Adversarial Learning for Neural Dialogue Generation , 2017, EMNLP.

[15]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[17]  Yun Lei,et al.  A noise robust i-vector extractor using vector taylor series for speaker recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[19]  Jesper Jensen,et al.  Speech Intelligibility Potential of General and Specialized Deep Neural Network Based Speech Enhancement Systems , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20]  Bin Ma,et al.  Text-dependent speaker verification: Classifiers, databases and RSR2015 , 2014, Speech Commun..

[21]  Guo-Jun Qi,et al.  Loss-Sensitive Generative Adversarial Networks on Lipschitz Densities , 2017, International Journal of Computer Vision.

[22]  Jun Guo,et al.  Effect of multi-condition training and speech enhancement methods on spoofing detection , 2016, 2016 First International Workshop on Sensing, Processing and Learning for Intelligent Machines (SPLINE).

[23]  Jesper Jensen,et al.  MMSE based noise PSD tracking with low complexity , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[26]  Søren Holdt Jensen,et al.  Speaker-Dependent Dictionary-Based Speech Enhancement for Text-Dependent Speaker Verification , 2016, INTERSPEECH.

[27]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[28]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  D. Wang,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2008, IEEE Trans. Neural Networks.

[30]  Yann LeCun,et al.  Energy-based Generative Adversarial Network , 2016, ICLR.

[31]  Olof Mogren,et al.  C-RNN-GAN: Continuous recurrent neural networks with adversarial training , 2016, ArXiv.

[32]  Ya Zhang,et al.  Deep feature for text-dependent speaker verification , 2015, Speech Commun..

[33]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .