ICSpk: Interpretable Complex Speaker Embedding Extractor from Raw Waveform

Recently, extracting speaker embedding directly from raw waveform has drawn increasing attention in the field of speaker verification. Parametric real-valued filters in the first convolutional layer are learned to transform the waveform into time-frequency representations. However, these methods only focus on the magnitude spectrum and the poor interpretability of the learned filters limits the performance. In this paper, we propose a complex speaker embedding extractor, named ICSpk, with higher interpretability and fewer parameters. Specifically, at first, to quantify the speaker-related frequency response of waveform, we modify the original short-term Fourier transform filters into a family of complex exponential filters, named interpretable complex (IC) filters. Each IC filter is confined by a complex exponential filter parameterized by frequency. Then, a deep complex-valued speaker embedding extractor is designed to operate on the complex-valued output of IC filters. The proposed ICSpkisevaluatedonVoxCelebandCNCelebdatabases. Experimental results demonstrate the IC filters-based system exhibits a significant improvement over the complex spectrogram based systems. Furthermore, theproposedICSpkoutperformsexisting raw waveform based systems by a large margin.

[1]  Thomas Fang Zheng,et al.  CN-Celeb: multi-genre speaker recognition , 2020, Speech Commun..

[2]  Marco Tagliasacchi,et al.  LEAF: A Learnable Frontend for Audio Classification , 2021, ICLR.

[3]  Man-Wai Mak,et al.  Wav2Spk: A Simple DNN Architecture for Learning Speaker Embeddings from Waveforms , 2020, INTERSPEECH.

[4]  Steve Renals,et al.  A Deep 2D Convolutional Network for Waveform-Based Speech Recognition , 2020, INTERSPEECH.

[5]  Yuexian Zou,et al.  Deep Speaker Embedding with Long Short Term Centroid Learning for Text-Independent Speaker Verification , 2020, INTERSPEECH.

[6]  Zhiyao Duan,et al.  Raw-x-vector: Multi-scale Time Domain Speaker Embedding Network , 2020, ArXiv.

[7]  Jee-weon Jung,et al.  Improved RawNet with Filter-wise Rescaling for Text-independent Speaker Verification using Raw Waveforms , 2020, INTERSPEECH.

[8]  Joon Son Chung,et al.  In defence of metric learning for speaker recognition , 2020, INTERSPEECH.

[9]  Dong Yu,et al.  Multi-Modal Multi-Channel Target Speech Separation , 2020, IEEE Journal of Selected Topics in Signal Processing.

[10]  Dong Wang,et al.  CN-Celeb: A Challenging Chinese Speaker Recognition Dataset , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Steve Renals,et al.  On Learning Interpretable CNNs with Parametric Modulated Kernel-Based Filters , 2019, INTERSPEECH.

[12]  Hye-jin Shim,et al.  RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification , 2019, INTERSPEECH.

[13]  Jung-Woo Ha,et al.  Phase-aware Speech Enhancement with Deep Complex U-Net , 2019, ICLR.

[14]  Yoshua Bengio,et al.  Speaker Recognition from Raw Waveform with SincNet , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[15]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[16]  Ming Li,et al.  Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System , 2018, Odyssey.

[17]  Sébastien Marcel,et al.  Towards Directly Modeling Raw Speech Signal for Speaker Verification Using CNNS , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Koichi Shinoda,et al.  Attentive Statistics Pooling for Deep Speaker Embedding , 2018, INTERSPEECH.

[19]  Quan Wang,et al.  Generalized End-to-End Loss for Speaker Verification , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Sandeep Subramanian,et al.  Deep Complex Networks , 2017, ICLR.

[21]  Chunlei Zhang,et al.  End-to-End Text-Independent Speaker Verification with Triplet Loss on Short Utterances , 2017, INTERSPEECH.

[22]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[23]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Ronald W. Schafer,et al.  Theory and Applications of Digital Speech Processing , 2010 .

[25]  Jianwu Dang,et al.  An investigation of dependencies between frequency components and speaker characteristics for text-independent speaker identification , 2008, Speech Commun..

[26]  Jr. J.P. Campbell,et al.  Speaker recognition: a tutorial , 1997, Proc. IEEE.