Speaker activity driven neural speech extraction

Target speech extraction, which extracts the speech of a target speaker in a mixture given auxiliary speaker clues, has recently received increased interest. Various clues have been investigated such as pre-recorded enrollment utterances, direction information, or video of the target speaker. In this paper, we explore the use of speaker activity information as an auxiliary clue for single-channel neural network-based speech extraction. We propose a speaker activity driven speech extraction neural network (ADEnet) and show that it can achieve performance levels competitive with enrollmentbased approaches, without the need for pre-recordings. We further demonstrate the potential of the proposed approach for processing meeting-like recordings, where speaker activity obtained from a diarization system is used as a speaker clue for ADEnet. We show that this simple yet practical approach can successfully extract speakers after diarization, which leads to improved ASR performance when using a single microphone, especially in high overlapping conditions, with relative word error rate reduction of up to 25 %.

[1]  Shinji Watanabe,et al.  End-to-End Neural Diarization: Reformulating Speaker Diarization as Simple Multi-label Classification , 2020, ArXiv.

[2]  Christian Jutten,et al.  Visual voice activity detection as a help for speech source separation from convolutive mixtures , 2007, Speech Commun..

[3]  Naoyuki Kanda,et al.  Integration of Speech Separation, Diarization, and Recognition for Multi-Speaker Meetings: System Description, Comparison, and Analysis , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[4]  Francesco Nesta,et al.  Audio/video supervised independent vector analysis through multimodal pilot dependent components , 2017, 2017 25th European Signal Processing Conference (EUSIPCO).

[5]  Shinji Watanabe,et al.  Speaker Diarization with Region Proposal Network , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Khe Chai Sim,et al.  Subspace LHUC for Fast Adaptation of Deep Neural Network Acoustic Models , 2016, INTERSPEECH.

[7]  Dong Yu,et al.  Neural Spatial Filter: Target Speaker Speech Separation Assisted with Directional Information , 2019, INTERSPEECH.

[8]  Shigeki Sagayama,et al.  User-guided independent vector analysis with source activity tuning , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Shinji Watanabe,et al.  ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[10]  Tomohiro Nakatani,et al.  Integrating DNN-based and spatial clustering-based mask estimation for robust MVDR beamforming , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Shaojin Ding,et al.  Personal VAD: Speaker-Conditioned Voice Activity Detection , 2019, Odyssey.

[13]  Shih-Chii Liu,et al.  Brain-informed speech separation (BISS) for enhancement of target speaker in multitalker speech perception , 2020, NeuroImage.

[14]  Tomohiro Nakatani,et al.  Compact Network for Speakerbeam Target Speaker Extraction , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Tomohiro Nakatani,et al.  Improving Speaker Discrimination of Target Speech Extraction With Time-Domain Speakerbeam , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Alan McCree,et al.  Speaker diarization using deep neural network embeddings , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Jon Barker,et al.  CHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented Recordings , 2020, 6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020).

[18]  Tomohiro Nakatani,et al.  Learning speaker representation for neural network based multichannel speaker extraction , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[19]  Christian Jutten,et al.  An Analysis of Visual Speech Information Applied to Voice Activity Detection , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[20]  Zhuo Chen,et al.  Continuous Speech Separation: Dataset and Analysis , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Jun Wang,et al.  Deep Extractor Network for Target Speaker Recovery From Single Channel Speech Mixtures , 2018, INTERSPEECH.

[22]  Reinhold Haeb-Umbach,et al.  Front-end processing for the CHiME-5 dinner party scenario , 2018, 5th International Workshop on Speech Processing in Everyday Environments (CHiME 2018).

[23]  Dong Yu,et al.  Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24]  Tomohiro Nakatani,et al.  Speaker-Aware Neural Network Based Beamformer for Speaker Extraction in Speech Mixtures , 2017, INTERSPEECH.

[25]  Tomohiro Nakatani,et al.  SpeakerBeam: Speaker Aware Neural Network for Target Speaker Extraction in Speech Mixtures , 2019, IEEE Journal of Selected Topics in Signal Processing.

[26]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Aleksei Romanenko,et al.  Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario , 2020, INTERSPEECH.

[28]  Peng Liu,et al.  Voice activity detection using visual information , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[29]  Joon Son Chung,et al.  The Conversation: Deep Audio-Visual Speech Enhancement , 2018, INTERSPEECH.

[30]  Nima Mesgarani,et al.  TasNet: Surpassing Ideal Time-Frequency Masking for Speech Separation. , 2018 .

[31]  Kevin Wilson,et al.  Looking to listen at the cocktail party , 2018, ACM Trans. Graph..