论文信息 - ICSpk: Interpretable Complex Speaker Embedding Extractor from Raw Waveform

ICSpk: Interpretable Complex Speaker Embedding Extractor from Raw Waveform

Recently, extracting speaker embedding directly from raw waveform has drawn increasing attention in the ﬁeld of speaker veriﬁcation. Parametric real-valued ﬁlters in the ﬁrst convolutional layer are learned to transform the waveform into time-frequency representations. However, these methods only focus on the magnitude spectrum and the poor interpretability of the learned ﬁlters limits the performance. In this paper, we propose a complex speaker embedding extractor, named ICSpk, with higher interpretability and fewer parameters. Speciﬁcally, at ﬁrst, to quantify the speaker-related frequency response of waveform, we modify the original short-term Fourier transform ﬁlters into a family of complex exponential ﬁlters, named interpretable complex (IC) ﬁlters. Each IC ﬁlter is conﬁned by a complex exponential ﬁlter parameterized by frequency. Then, a deep complex-valued speaker embedding extractor is designed to operate on the complex-valued output of IC ﬁlters. The proposed ICSpkisevaluatedonVoxCelebandCNCelebdatabases. Experimental results demonstrate the IC ﬁlters-based system exhibits a signiﬁcant improvement over the complex spectrogram based systems. Furthermore, theproposedICSpkoutperformsexisting raw waveform based systems by a large margin.

[1] Thomas Fang Zheng,et al. CN-Celeb: multi-genre speaker recognition , 2020, Speech Commun..

[2] Marco Tagliasacchi,et al. LEAF: A Learnable Frontend for Audio Classification , 2021, ICLR.

[3] Man-Wai Mak,et al. Wav2Spk: A Simple DNN Architecture for Learning Speaker Embeddings from Waveforms , 2020, INTERSPEECH.

[4] Steve Renals,et al. A Deep 2D Convolutional Network for Waveform-Based Speech Recognition , 2020, INTERSPEECH.

[5] Yuexian Zou,et al. Deep Speaker Embedding with Long Short Term Centroid Learning for Text-Independent Speaker Verification , 2020, INTERSPEECH.

[6] Zhiyao Duan,et al. Raw-x-vector: Multi-scale Time Domain Speaker Embedding Network , 2020, ArXiv.

[7] Jee-weon Jung,et al. Improved RawNet with Filter-wise Rescaling for Text-independent Speaker Verification using Raw Waveforms , 2020, INTERSPEECH.

[8] Joon Son Chung,et al. In defence of metric learning for speaker recognition , 2020, INTERSPEECH.

[9] Dong Yu,et al. Multi-Modal Multi-Channel Target Speech Separation , 2020, IEEE Journal of Selected Topics in Signal Processing.

[10] Dong Wang,et al. CN-Celeb: A Challenging Chinese Speaker Recognition Dataset , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11] Steve Renals,et al. On Learning Interpretable CNNs with Parametric Modulated Kernel-Based Filters , 2019, INTERSPEECH.

[12] Hye-jin Shim,et al. RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification , 2019, INTERSPEECH.

[13] Jung-Woo Ha,et al. Phase-aware Speech Enhancement with Deep Complex U-Net , 2019, ICLR.

[14] Yoshua Bengio,et al. Speaker Recognition from Raw Waveform with SincNet , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[15] Joon Son Chung,et al. VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[16] Ming Li,et al. Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System , 2018, Odyssey.

[17] Sébastien Marcel,et al. Towards Directly Modeling Raw Speech Signal for Speaker Verification Using CNNS , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18] Koichi Shinoda,et al. Attentive Statistics Pooling for Deep Speaker Embedding , 2018, INTERSPEECH.

[19] Quan Wang,et al. Generalized End-to-End Loss for Speaker Verification , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20] Sandeep Subramanian,et al. Deep Complex Networks , 2017, ICLR.

[21] Chunlei Zhang,et al. End-to-End Text-Independent Speaker Verification with Triplet Loss on Short Utterances , 2017, INTERSPEECH.

[22] Joon Son Chung,et al. VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[23] Patrick Kenny,et al. Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[24] Ronald W. Schafer,et al. Theory and Applications of Digital Speech Processing , 2010 .

[25] Jianwu Dang,et al. An investigation of dependencies between frequency components and speaker characteristics for text-independent speaker identification , 2008, Speech Commun..

[26] Jr. J.P. Campbell,et al. Speaker recognition: a tutorial , 1997, Proc. IEEE.