Robust Speaker Recognition Based on Single-Channel and Multi-Channel Speech Enhancement

Deep neural network (DNN) embeddings for speaker recognition have recently attracted much attention. Compared to i-vectors, they are more robust to noise and room reverberation as DNNs leverage large-scale training. This article addresses the question of whether speech enhancement approaches are still useful when DNN embeddings are used for speaker recognition. We investigate single- and multi-channel speech enhancement for text-independent speaker verification based on x-vectors in conditions where strong diffuse noise and reverberation are both present. Single-channel (monaural) speech enhancement is based on complex spectral mapping and is applied to individual microphones. We use masking-based minimum variance distortion-less response (MVDR) beamformer and its rank-1 approximation for multi-channel speech enhancement. We propose a novel method of deriving time-frequency masks from the estimated complex spectrogram. In addition, we investigate gammatone frequency cepstral coefficients (GFCCs) as robust speaker features. Systematic evaluations and comparisons on the NIST SRE 2010 retransmitted corpus show that both monaural and multi-channel speech enhancement significantly outperform x-vector's performance, and our covariance matrix estimate is effective for the MVDR beamformer.

[1]  D. Wang,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2008, IEEE Trans. Neural Networks.

[2]  Ke Tan,et al.  Complex Spectral Mapping with a Convolutional Recurrent Network for Monaural Speech Enhancement , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Hagai Aronowitz,et al.  Audio enhancing with DNN autoencoder for speaker recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Lukás Burget,et al.  Analysis of DNN Speech Signal Enhancement for Robust Speaker Recognition , 2018, Comput. Speech Lang..

[5]  Hao Tang,et al.  VoiceID Loss: Speech Enhancement for Speaker Verification , 2019, INTERSPEECH.

[6]  DeLiang Wang,et al.  A speech enhancement algorithm by iterating single- and multi-microphone processing and its application to robust ASR , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  DeLiang Wang,et al.  Analyzing noise robustness of MFCC and GFCC features in speaker identification , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Man-Wai Mak,et al.  SNR-Invariant PLDA Modeling in Nonparametric Subspace for Robust Speaker Verification , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  DeLiang Wang,et al.  Large-scale training to increase speech intelligibility for hearing-impaired listeners in novel noises. , 2016, The Journal of the Acoustical Society of America.

[10]  Takuya Yoshioka,et al.  Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  John H. L. Hansen,et al.  Speaker Recognition by Machines and Humans: A tutorial review , 2015, IEEE Signal Processing Magazine.

[12]  L. Burget,et al.  Promoting robustness for speaker modeling in the community: the PRISM evaluation set , 2011 .

[13]  Yun Lei,et al.  A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[15]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  DeLiang Wang,et al.  Complex Ratio Masking for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  Zheng-Hua Tan,et al.  Conditional Generative Adversarial Networks for Speech Enhancement and Noise-Robust Speaker Verification , 2017, INTERSPEECH.

[18]  Daniel Povey,et al.  MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[19]  Yonghong Yan,et al.  Rank-1 constrained Multichannel Wiener Filter for speech recognition in noisy environments , 2017, Comput. Speech Lang..

[20]  Yun Lei,et al.  Advances in deep neural network approaches to speaker recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Emmanuel Vincent,et al.  A Consolidated Perspective on Multimicrophone Speech Enhancement and Source Separation , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Zhong-Qiu Wang,et al.  Mask Weighted Stft Ratios for Relative Transfer Function Estimation and ITS Application to Robust ASR , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[24]  Chengzhu Yu,et al.  The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[25]  Jonathan Le Roux,et al.  Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Zhong-Qiu Wang,et al.  Integrating Spectral and Spatial Features for Multi-Channel Speaker Separation , 2018, INTERSPEECH.

[27]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[28]  Sergey Ioffe,et al.  Probabilistic Linear Discriminant Analysis , 2006, ECCV.

[29]  Tomohiro Nakatani,et al.  Generalization of Multi-Channel Linear Prediction Methods for Blind MIMO Impulse Response Shortening , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[30]  Daniel Garcia-Romero,et al.  Time delay deep neural network-based universal background models for speaker recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[31]  Reinhold Häb-Umbach,et al.  Neural network based spectral mask estimation for acoustic beamforming , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Lukás Burget,et al.  Analysis and Optimization of Bottleneck Features for Speaker Recognition , 2016, Odyssey.

[33]  John H. L. Hansen,et al.  Assessment of single-channel speech enhancement techniques for speaker identification under mismatched conditions , 2010, INTERSPEECH.

[34]  Yonghong Yan,et al.  Effect of Steering Vector Estimation on MVDR Beamformer for Noisy Speech Recognition , 2018, 2018 IEEE 23rd International Conference on Digital Signal Processing (DSP).

[35]  Hao Li,et al.  A Robust Text-independent Speaker Verification Method Based on Speech Separation and Deep Speaker , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Mireia Díez,et al.  End-to-End DNN Based Speaker Recognition Inspired by I-Vector and PLDA , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Yu Tsao,et al.  Complex spectrogram enhancement by convolutional neural network with multi-metrics learning , 2017, 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP).

[38]  Kuldip K. Paliwal,et al.  The importance of phase in speech enhancement , 2011, Speech Commun..

[39]  Pavel Matejka,et al.  Dereverberation and Beamforming in Far-Field Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  I. Cohen,et al.  Generating nonstationary multisensor signals under a spatial coherence constraint. , 2008, The Journal of the Acoustical Society of America.

[41]  DeLiang Wang,et al.  CASA-Based Robust Speaker Identification , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[42]  Sanjiv Kumar,et al.  On the Convergence of Adam and Beyond , 2018 .

[43]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  DeLiang Wang,et al.  Robust speaker identification using auditory features and computational auditory scene analysis , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[45]  Marc Moonen,et al.  Low-rank Approximation Based Multichannel Wiener Filter Algorithms for Noise Reduction with Application in Cochlear Implants , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[46]  DeLiang Wang,et al.  Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[47]  DeLiang Wang,et al.  Robust speaker recognition based on DNN/i-vectors and speech separation , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[48]  Yun Lei,et al.  Towards noise-robust speaker recognition using probabilistic linear discriminant analysis , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49]  O. L. Frost,et al.  An algorithm for linearly constrained adaptive array processing , 1972 .

[50]  Zhong-Qiu Wang,et al.  Deep Learning Based Multi-Channel Speaker Recognition in Noisy and Reverberant Environments , 2019, INTERSPEECH.

[51]  Yong Xu,et al.  Joint Training of Complex Ratio Mask Based Beamformer and Acoustic Model for Noise Robust Asr , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[52]  Pavel Matejka,et al.  On the use of X-vectors for Robust Speaker Recognition , 2018, Odyssey.