Investigation on Neural Bandwidth Extension of Telephone Speech for Improved Speaker Recognition

We extend our previous work on training mixed-bandwidth (BW) speaker recognition system by predicting missing information in upperband (UB) of upsampled telephone speech. Mixed-BW systems combine speech from narrowband (NB) and wideband (WB) speech corpora by basic upsampling of NB speech with low-pass filter interpolator, resulting in no information loss in the original WB speech. In this work, we explore the usage of a deep residual full-convolutional neural network (CNN) and a bidirectional long short term memory (BLSTM) network along with a previously proposed deep neural network (DNN) for bandwidth extension (BWE) of NB telephone speech. Speaker recognition systems trained with bandwidth extended features improved in performance over mixed-BW and NB baseline systems. In terms of detection cost function (DCF), the CNN-BWE system improved by 10.78% and 15.96% (relative) in the Speakers In The Wild (SITW) eval core and assist-multi-speaker condition respectively w.r.t. the NB baseline; and improved by 3.21% and 4.13% w.r.t. to the mixed-BW baseline.

[1]  Li Fei-Fei,et al.  Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[2]  Niko Brümmer,et al.  The speaker partitioning problem , 2010, Odyssey.

[3]  Chin-Hui Lee,et al.  A deep neural network approach to speech bandwidth expansion , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Chin-Hui Lee,et al.  DNN-based speech bandwidth expansion and its application to adding high-frequency missing features for automatic speech recognition of narrowband speech , 2015, INTERSPEECH.

[5]  Enhong Chen,et al.  An experimental study on joint modeling of mixed-bandwidth data via deep neural networks for robust speech recognition , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[6]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[7]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[8]  Yifan Gong,et al.  Improving wideband speech recognition using mixed-bandwidth training data in CD-DNN-HMM , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[9]  Najim Dehak,et al.  Investigation on Bandwidth Extension for Speaker Recognition , 2018, INTERSPEECH.

[10]  Niko Brümmer AGNITIO ’ s Speaker Recognition System for EVALITA 2009 , 2009 .

[11]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[12]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).