论文信息 - Investigation on Neural Bandwidth Extension of Telephone Speech for Improved Speaker Recognition

Investigation on Neural Bandwidth Extension of Telephone Speech for Improved Speaker Recognition

We extend our previous work on training mixed-bandwidth (BW) speaker recognition system by predicting missing information in upperband (UB) of upsampled telephone speech. Mixed-BW systems combine speech from narrowband (NB) and wideband (WB) speech corpora by basic upsampling of NB speech with low-pass filter interpolator, resulting in no information loss in the original WB speech. In this work, we explore the usage of a deep residual full-convolutional neural network (CNN) and a bidirectional long short term memory (BLSTM) network along with a previously proposed deep neural network (DNN) for bandwidth extension (BWE) of NB telephone speech. Speaker recognition systems trained with bandwidth extended features improved in performance over mixed-BW and NB baseline systems. In terms of detection cost function (DCF), the CNN-BWE system improved by 10.78% and 15.96% (relative) in the Speakers In The Wild (SITW) eval core and assist-multi-speaker condition respectively w.r.t. the NB baseline; and improved by 3.21% and 4.13% w.r.t. to the mixed-BW baseline.

[1] Li Fei-Fei,et al. Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[2] Niko Brümmer,et al. The speaker partitioning problem , 2010, Odyssey.

[3] Chin-Hui Lee,et al. A deep neural network approach to speech bandwidth expansion , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4] Chin-Hui Lee,et al. DNN-based speech bandwidth expansion and its application to adding high-frequency missing features for automatic speech recognition of narrowband speech , 2015, INTERSPEECH.

[5] Enhong Chen,et al. An experimental study on joint modeling of mixed-bandwidth data via deep neural networks for robust speech recognition , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[6] Joon Son Chung,et al. VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[7] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[8] Yifan Gong,et al. Improving wideband speech recognition using mixed-bandwidth training data in CD-DNN-HMM , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[9] Najim Dehak,et al. Investigation on Bandwidth Extension for Speaker Recognition , 2018, INTERSPEECH.

[10] Niko Brümmer. AGNITIO ’ s Speaker Recognition System for EVALITA 2009 , 2009 .

[11] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[12] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Sanjeev Khudanpur,et al. X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).