P-vectors: A Parallel-Coupled TDNN/Transformer Network for Speaker Verification

Typically, the Time-Delay Neural Network (TDNN) and Transformer can serve as a backbone for Speaker Verification (SV). Both of them have advantages and disadvantages from the perspective of global and local feature modeling. How to effectively integrate these two style features is still an open issue. In this paper, we explore a Parallel-coupled TDNN/Transformer Network (p-vectors) to replace the serial hybrid networks. The p-vectors allows TDNN and Transformer to learn the complementary information from each other through Soft Feature Alignment Interaction (SFAI) under the premise of preserving local and global features. Also, p-vectors uses the Spatial Frequency-channel Attention (SFA) to enhance the spatial interdependence modeling for input features. Finally, the outputs of dual branches of p-vectors are combined by Embedding Aggregation Layer (EAL). Experiments show that p-vectors outperforms MACCIF-TDNN and MFA-Conformer with relative improvements of 11.5% and 13.9% in EER on VoxCeleb1-O.

[1]  Joon Son Chung,et al.  VoxSRC 2022: The Fourth VoxCeleb Speaker Recognition Challenge , 2023, ArXiv.

[2]  Chengming Liu,et al.  Global–Local Self-Attention Based Transformer for Speaker Verification , 2022, Applied Sciences.

[3]  A. Etemad,et al.  Fine-grained Early Frequency Attention for Deep Speaker Recognition , 2022, 2022 International Joint Conference on Neural Networks (IJCNN).

[4]  Y. Qian,et al.  Local Information Modeling with Self-Attention for Speaker Verification , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Lantian Li,et al.  Reliable Visualization for Deep Speaker Recognition , 2022, INTERSPEECH.

[6]  Haibin Wu,et al.  MFA-Conformer: Multi-scale Feature Aggregation Conformer for Automatic Speaker Verification , 2022, INTERSPEECH.

[7]  Rohan Kumar Das,et al.  MFA: TDNN with Multi-Scale Frequency-Channel Attention for Text-Independent Speaker Verification with Short Utterances , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Fangyuan Wang,et al.  MACCIF-TDNN: Multi Aspect Aggregation of Channel and Context Interdependence Features in TDNN-Based Speaker Verification , 2021, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[9]  Qingyang Hong,et al.  Additive Phoneme-aware Margin Softmax Loss for Language Recognition , 2021, Interspeech.

[10]  Ming-Ming Cheng,et al.  LayerCAM: Exploring Hierarchical Class Activation Maps for Localization , 2021, IEEE Transactions on Image Processing.

[11]  Yaowei Wang,et al.  Conformer: Local Features Coupling Global Representations for Visual Recognition , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Kris Demuynck,et al.  Integrating Frequency Translational Invariance in TDNNs and Frequency Positional Information in 2D ResNets to Enhance Speaker Verification , 2021, Interspeech.

[13]  Shinji Watanabe,et al.  Recent Developments on Espnet Toolkit Boosted By Conformer , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Haizhou Li,et al.  Speaker-Utterance Dual Attention for Speaker and Utterance Verification , 2020, INTERSPEECH.

[15]  S Umesh,et al.  S-vectors: Speaker Embeddings based on Transformer's Encoder for Text-Independent Speaker Verification , 2020, ArXiv.

[16]  Pooyan Safari,et al.  Self-attention encoding and pooling for speaker recognition , 2020, INTERSPEECH.

[17]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[18]  Kris Demuynck,et al.  ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification , 2020, INTERSPEECH.

[19]  Ian McLoughlin,et al.  An Effective Deep Embedding Learning Architecture for Speaker Verification , 2019, INTERSPEECH.

[20]  Kai Zhao,et al.  Res2Net: A New Multi-Scale Backbone Architecture , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[22]  Gang Sun,et al.  Squeeze-and-Excitation Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Lukás Burget,et al.  Analysis of Score Normalization in Multilingual Speaker Recognition , 2017, INTERSPEECH.

[24]  Sanjeev Khudanpur,et al.  Deep Neural Network Embeddings for Text-Independent Speaker Verification , 2017, INTERSPEECH.

[25]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[26]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[27]  Sanjeev Khudanpur,et al.  A study on data augmentation of reverberant speech for robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Daniel Povey,et al.  MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[29]  Leslie N. Smith,et al.  Cyclical Learning Rates for Training Neural Networks , 2015, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[30]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..