Investigation on Bandwidth Extension for Speaker Recognition

In this work, we investigate training speaker recognition systems on wideband (WB) features and compare their performance with narrowband (NB) baselines. NIST speaker recognition evaluations have mainly driven speaker recognition research in the past years. Because of the target application of these evaluations, most data available to train speaker recognition systems is NB telephone speech. Meanwhile, WB data have been more scarce not being enough to train factor analysis and PLDA models. Thus, the usual practice when dealing with WB speech consists in downsampling the signal to 8 kHz, which implies potential loss of useful information. Instead, we experimented upsampling the training telephone data and leaving the WB data unchanged. We adopt two techniques to upsample telephone data: (1) using a feed-forward neural network, termed Bandwidth Extension (BWE) network, to predict WB features given NB features as input; and (2) using basic upsampling with a low-pass filter interpolator. While the former intends to estimate the high frequency information, the latter does not. The upsampled features are used to train state-of-the art i-vector and recently proposed x-vector models. We evaluated the systems on Speakers In The Wild (SITW) database obtaining 11.5% relative improvement in detection cost function (DCF) with x-vector model.

[1]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[3]  Sanjeev Khudanpur,et al.  Deep neural network-based speaker embeddings for end-to-end speaker verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[4]  Chin-Hui Lee,et al.  DNN-based speech bandwidth expansion and its application to adding high-frequency missing features for automatic speech recognition of narrowband speech , 2015, INTERSPEECH.

[5]  Albert Strasheim AGNITIO's Speaker Recognition System for EVALITA 2009 , 2009 .

[6]  Yifan Gong,et al.  Improving wideband speech recognition using mixed-bandwidth training data in CD-DNN-HMM , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[7]  Hynek Hermansky,et al.  Beyond NYQUIST: towards the recovery of broad-bandwidth speech from narrow-bandwidth speech , 1995, EUROSPEECH.

[8]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[9]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[10]  Yifan Gong,et al.  Large-Scale Domain Adaptation via Teacher-Student Learning , 2017, INTERSPEECH.

[11]  Sanjeev Khudanpur,et al.  Deep Neural Network Embeddings for Text-Independent Speaker Verification , 2017, INTERSPEECH.

[12]  Aaron Lawson,et al.  The Speakers in the Wild (SITW) Speaker Recognition Database , 2016, INTERSPEECH.

[13]  Niko Brümmer,et al.  The speaker partitioning problem , 2010, Odyssey.

[14]  Xiaodan Zhuang,et al.  Improving DNN Bluetooth Narrowband Acoustic Models by Cross-Bandwidth and Cross-Lingual Initialization , 2017, INTERSPEECH.