Speech Bandwidth Expansion For Speaker Recognition On Telephony Audio

Practical applications often require speaker recognition systems to work well for audio files of different sampling rates. However, the performance of speaker recognition systems degrades substantially under a mismatched audio sampling rate between the training and testing conditions. For example, wideband speaker recognition models trained on audio files with a 16kHz sampling rate perform poorly on telephony audio with an 8kHz sampling rate due to the missing higher frequency information. In this paper, we propose a Deep Neural Network (DNN) based system to estimate the speech spectrum in the frequencies above 4kHz for narrowband 8kHz telephony audio. We train the proposed system on speech datasets processed using various simulated telephony codecs. Additionally, we perform speaker recognition experiments by using the bandwidth expansion system as a preprocessor for speaker verification using wideband models. The evaluation datasets used for speaker verification are codec-degraded downsampled Voxceleb1 and SITW, and the NIST SRE 2010 10s-10s condition. We see a significant improvement in the results compared to a simple upsampling with interpolation and low-pass filtering. Additionally, these promising experiments show that the proposed bandwidth expansion system can be used successfully as a data augmentation for training speaker embedding systems.

[1]  Najim Dehak,et al.  Investigation on Neural Bandwidth Extension of Telephone Speech for Improved Speaker Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Cyril Guillaume,et al.  An Instrumental Quality Measure for Artificially Bandwidth-Extended Speech Signals , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  Junichi Yamagishi,et al.  CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit , 2017 .

[4]  Chin-Hui Lee,et al.  A deep neural network approach to speech bandwidth expansion , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Hitoshi Yamamoto,et al.  Speaker Augmentation and Bandwidth Extension for Deep Speaker Embedding , 2019, INTERSPEECH.

[6]  Roch Lefebvre,et al.  The adaptive multirate wideband speech codec (AMR-WB) , 2002, IEEE Trans. Speech Audio Process..

[7]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8]  Xing Ji,et al.  CosFace: Large Margin Cosine Loss for Deep Face Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  Hynek Hermansky,et al.  Beyond NYQUIST: towards the recovery of broad-bandwidth speech from narrow-bandwidth speech , 1995, EUROSPEECH.

[10]  Aaron Lawson,et al.  The Speakers in the Wild (SITW) Speaker Recognition Database , 2016, INTERSPEECH.

[11]  Fumitada Itakura,et al.  Text-dependent speaker recognition using the information in the higher frequency band , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Yifan Gong,et al.  Improving wideband speech recognition using mixed-bandwidth training data in CD-DNN-HMM , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[13]  Chin-Hui Lee,et al.  DNN-based speech bandwidth expansion and its application to adding high-frequency missing features for automatic speech recognition of narrowband speech , 2015, INTERSPEECH.

[14]  Paavo Alku,et al.  Neural Network-Based Artificial Bandwidth Expansion of Speech , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Koen Vos,et al.  SILK Speech Codec , 2010 .

[16]  Paavo Alku,et al.  Bandwidth Extension of Telephone Speech Using a Neural Network and a Filter Bank Implementation for Highband Mel Spectrum , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Najim Dehak,et al.  Investigation on Bandwidth Extension for Speaker Recognition , 2018, INTERSPEECH.

[18]  Bhiksha Raj,et al.  Bandwidth expansion of narrowband speech using non-negative matrix factorization , 2005, INTERSPEECH.

[19]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[20]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Alvin F. Martin,et al.  The NIST 2010 speaker recognition evaluation , 2010, INTERSPEECH.

[22]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[23]  Paavo Alku,et al.  Artificial bandwidth expansion method to improve intelligibility and quality of AMR-coded narrowband speech , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[24]  Koen Vos,et al.  Voice Coding with Opus , 2013 .