FDN: Finite Difference Network with Hierarchical Convolutional Features for Text-independent Speaker Verification

In recent years, using raw waveforms as input for deep networks has been widely explored for the speaker verification system. For example, RawNet and RawNet2 extracted speaker’s feature embeddings from waveforms automatically for recognizing their voice, which can vastly reduce the frontend computation and obtain state-of-the-art performance. However, these models do not consider the speaker’s highlevel behavioral features, such as intonation, indicating each speaker’s universal style, rhythm, etc. This paper presents a novel network that can handle the intonation information by computing the finite difference of different speakers’ utterance variations. Furthermore, a hierarchical way is also designed to enhance the intonation property from coarse to fine to improve the system accuracy. The high-level intonation features are then fused with the low-level embedding features. Experimental results on official VoxCeleb1 test data, VoxCeleb1-E, and VoxCeleb-H protocols show our method outperforms and robustness existing state-of-the-art systems. To facilitate further research, code is available at https://github.com/happyjin/FDN

[1]  R. Maher,et al.  Audio forensic examination , 2009, IEEE Signal Processing Magazine.

[2]  J.P. Campbell,et al.  Forensic speaker recognition , 2009, IEEE Signal Processing Magazine.

[3]  Joon Son Chung,et al.  Utterance-level Aggregation for Speaker Recognition in the Wild , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Sébastien Marcel,et al.  Towards Directly Modeling Raw Speech Signal for Speaker Verification Using CNNS , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Hye-jin Shim,et al.  Segment Aggregation for short utterances speaker verification using raw waveforms , 2020, INTERSPEECH.

[6]  Koichi Shinoda,et al.  Attentive Statistics Pooling for Deep Speaker Embedding , 2018, INTERSPEECH.

[7]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[8]  Yoshua Bengio,et al.  Learning Speaker Representations with Mutual Information , 2018, INTERSPEECH.

[9]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Hye-jin Shim,et al.  RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification , 2019, INTERSPEECH.

[12]  Dengxin Dai,et al.  Unified Hypersphere Embedding for Speaker Recognition , 2018, ArXiv.

[13]  Hye-jin Shim,et al.  Improved RawNet with Feature Map Scaling for Text-Independent Speaker Verification Using Raw Waveforms , 2020, INTERSPEECH.

[14]  Joon Son Chung,et al.  Voxceleb: Large-scale speaker verification in the wild , 2020, Comput. Speech Lang..

[15]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[16]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[17]  Yoshua Bengio,et al.  Speaker Recognition from Raw Waveform with SincNet , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[18]  Hye-jin Shim,et al.  A Complete End-to-End Speaker Verification System Using Deep Neural Networks: From Raw Signals to Verification Result , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Jozue Vieira Filho,et al.  Forensic Speaker Verification Using Ordinary Least Squares , 2019, Sensors.

[20]  C A Fowler,et al.  Fundamental frequency declination is not unique to human speech: evidence from nonhuman primates. , 1992, The Journal of the Acoustical Society of America.

[21]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .