Multi-User Voicefilter-Lite via Attentive Speaker Embedding

In this paper, we propose a solution to allow speaker conditioned speech models, such as VoiceFilter-Lite, to support an arbitrary number of enrolled users in a single pass. This is achieved via an attention mechanism on multiple speaker embeddings to compute a single attentive embedding, which is then used as a side input to the model. We implemented multi-user VoiceFilter-Lite and evaluated it for three tasks: (1) a streaming automatic speech recognition (ASR) task; (2) a text-independent speaker verification task; and (3) a personalized keyphrase detection task, where ASR has to detect keyphrases from multiple enrolled users in a noisy environment. Our experiments show that, with up to four enrolled users, multiuser VoiceFilter-Lite is able to significantly reduce speech recognition and speaker verification errors when there is overlapping speech, without affecting performance under other acoustic conditions. This attentive speaker embedding approach can also be easily applied to other speaker-conditioned models such as personal VAD and personalized ASR.

[1]  Tomohiro Nakatani,et al.  Learning speaker representation for neural network based multichannel speaker extraction , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[2]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[3]  Ian McGraw,et al.  Personalized Keyphrase Detection using Speaker and Environment Information , 2021, Interspeech.

[4]  Jun Du,et al.  Online Speaker Adaptation for LVCSR Based on Attention Mechanism , 2018, 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[5]  Hitoshi Yamamoto,et al.  Attention Mechanism in Speaker Recognition: What Does it Learn in Deep Speaker Embedding? , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[6]  Tara N. Sainath,et al.  Automatic gain control and multi-style training for robust small-footprint keyword spotting with deep neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Ngoc Thang Vu,et al.  End-to-End Multi-Speaker Speech Recognition using Speaker Embeddings and Transfer Learning , 2019, INTERSPEECH.

[8]  Naoyuki Kanda,et al.  Joint Speaker Counting, Speech Recognition, and Speaker Identification for Overlapped Speech of Any Number of Speakers , 2020, INTERSPEECH.

[9]  Tara N. Sainath,et al.  Generation of Large-Scale Simulated Utterances in Virtual Rooms to Train Deep-Neural Networks for Far-Field Speech Recognition in Google Home , 2017, INTERSPEECH.

[10]  Yanbing Liu,et al.  Deep CNNs With Self-Attention for Speaker Identification , 2019, IEEE Access.

[11]  Chunlei Zhang,et al.  Improving RNN Transducer with Target Speaker Extraction and Neural Uncertainty Estimation , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Shaojin Ding,et al.  Personal VAD: Speaker-Conditioned Voice Activity Detection , 2019, Odyssey.

[13]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[14]  Quan Wang,et al.  Dr-Vectors: Decision Residual Networks and an Improved Loss for Speaker Recognition , 2021, Interspeech 2021.

[15]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[16]  Wei Rao,et al.  Target Speaker Verification With Selective Auditory Attention for Single and Multi-Talker Speech , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  Jun Wang,et al.  Deep Extractor Network for Target Speaker Recovery From Single Channel Speech Mixtures , 2018, INTERSPEECH.

[18]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Rohit Prabhavalkar,et al.  On the Efficient Representation and Execution of Deep Acoustic Models , 2016, INTERSPEECH.

[20]  Tomohiro Nakatani,et al.  Speaker-Aware Neural Network Based Beamformer for Speaker Extraction in Speech Mixtures , 2017, INTERSPEECH.

[21]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[22]  Dong Wang,et al.  CN-Celeb: A Challenging Chinese Speaker Recognition Dataset , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Quan Wang,et al.  Attention-Based Models for Text-Dependent Speaker Verification , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Turaj Zakizadeh Shabestary,et al.  Hotword Cleaner: Dual-microphone Adaptive Noise Cancellation with Deferred Filter Coefficients for Robust Keyword Spotting , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Yifan Gong,et al.  Speaker Separation Using Speaker Inventories and Estimated Speech , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[26]  Wei Li,et al.  VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition , 2020, INTERSPEECH.

[27]  Haizhou Li,et al.  SpEx: Multi-Scale Time Domain Speaker Extraction Network , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[28]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[29]  Quan Wang,et al.  Generalized End-to-End Loss for Speaker Verification , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  E. A. Martin,et al.  Multi-style training for robust isolated-word speech recognition , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[31]  Tara N. Sainath,et al.  Recognizing Long-Form Speech Using Streaming End-to-End Models , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[32]  Quan Wang,et al.  Version Control of Speaker Recognition Systems , 2020, ArXiv.

[33]  B. Widrow,et al.  Adaptive noise cancelling: Principles and applications , 1975 .

[34]  Tomohiro Nakatani,et al.  Single Channel Target Speaker Extraction and Recognition with Speaker Beam , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[36]  Sanjeev Khudanpur,et al.  A study on data augmentation of reverberant speech for robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  John R. Hershey,et al.  VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking , 2018, INTERSPEECH.

[38]  Liang Qiao,et al.  Optimizing Speech Recognition For The Edge , 2019, ArXiv.

[39]  Junichi Yamagishi,et al.  CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit , 2017 .

[40]  Jia Pan,et al.  Speaker Adaptive Training for Speech Recognition Based on Attention-Over-Attention Mechanism , 2020, INTERSPEECH.

[41]  Jesper Jensen,et al.  Permutation invariant training of deep models for speaker-independent multi-talker speech separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).