LSTM Based End-to-End Text-Independent Speaker Verification Using Raw Waveform

Speaker can be discriminated either at voice source level or vocal tract system level. Conventionally Mel-Frequency Cesptral Coefficients (MFCCs) or Mel filterbank energies are employed as input acoustic feature in neural network based speaker verification systems. In this paper, we investigate the LSTM based speaker verification using raw waveform as input feature. The basic LSTM based SV model and the model with attention layer are trained and optimized on two datasets using raw waveform feature and Fbank feature respectively. And experimental results show that compared with the model trained using Fbank feature, the model trained using raw waveform can achieve promising performance, raw waveform is a competitive acoustic feature for LSTM based speaker verification.

[1]  Georg Heigold,et al.  End-to-end text-dependent speaker verification , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Hye-jin Shim,et al.  RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification , 2019, INTERSPEECH.

[3]  Kai Yu,et al.  Investigating Raw Wave Deep Neural Networks for End-to-End Speaker Spoofing Detection , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  Hye-jin Shim,et al.  Avoiding Speaker Overfitting in End-to-End DNNs Using Raw Waveform for Text-Independent Speaker Verification , 2018, INTERSPEECH.

[5]  Xiao Liu,et al.  Deep Speaker: an End-to-End Neural Speaker Embedding System , 2017, ArXiv.

[6]  Sébastien Marcel,et al.  Towards Directly Modeling Raw Speech Signal for Speaker Verification Using CNNS , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Quan Wang,et al.  Attention-Based Models for Text-Dependent Speaker Verification , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Hye-jin Shim,et al.  A Complete End-to-End Speaker Verification System Using Deep Neural Networks: From Raw Signals to Verification Result , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[11]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Quan Wang,et al.  Generalized End-to-End Loss for Speaker Verification , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).