论文信息 - Joint speaker diarization and speech recognition based on region proposal networks - 字舞流文

Joint speaker diarization and speech recognition based on region proposal networks

Shinji Watanabe | Marc Delcroix | Sanjeev Khudanpur | Zili Huang | Leibny Paola García-Perera | Desh Raj | Leibny Paola Garcia | S. Khudanpur | Shinji Watanabe | Zili Huang | Marc Delcroix | Desh Raj

[1] George Saon,et al. Speaker adaptation of neural network acoustic models using i-vectors , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[2] Douglas A. Reynolds,et al. Approaches and applications of audio diarization , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[3] Nima Mesgarani,et al. TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4] Nima Mesgarani,et al. Speaker-Independent Speech Separation With Deep Attractor Network , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5] Jun Du,et al. Speaker Diarization with Enhancing Speech for the First DIHARD Challenge , 2018, INTERSPEECH.

[6] Shuai Wang,et al. But System for the Second Dihard Speech Diarization Challenge , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7] Alan McCree,et al. Speaker diarization using deep neural network embeddings , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8] Douglas A. Reynolds,et al. An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[9] Dong Yu,et al. Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10] Leibny Paola García-Perera,et al. Overlap-Aware Diarization: Resegmentation Using Neural End-to-End Overlapped Speech Detection , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Kaiming He,et al. Group Normalization , 2018, ECCV.

[13] H. Edelsbrunner,et al. Efficient algorithms for agglomerative hierarchical clustering methods , 1984 .

[14] Valentin Andrei,et al. Detecting Overlapped Speech on Short Timeframes Using Deep Learning , 2017, INTERSPEECH.

[15] Jonathan Le Roux,et al. End-To-End Multi-Speaker Speech Recognition With Transformer , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Nima Mesgarani,et al. Deep attractor network for single-microphone speaker separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17] Tomohiro Nakatani,et al. SpeakerBeam: Speaker Aware Neural Network for Target Speaker Extraction in Speech Mixtures , 2019, IEEE Journal of Selected Topics in Signal Processing.

[18] Shinji Watanabe,et al. End-to-End SpeakerBeam for Single Channel Target Speech Recognition , 2019, INTERSPEECH.

[19] Jonathan Le Roux,et al. Single-Channel Multi-Speaker Separation Using Deep Clustering , 2016, INTERSPEECH.

[20] Nicholas W. D. Evans,et al. Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[21] Shinji Watanabe,et al. Diarization is Hard: Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge , 2018, INTERSPEECH.

[22] Jonathan Le Roux,et al. End-to-End Multi-Speaker Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23] Jean Carletta,et al. The AMI Meeting Corpus: A Pre-announcement , 2005, MLMI.

[24] Zhuo Chen,et al. Continuous Speech Separation: Dataset and Analysis , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25] Marie Kunesová,et al. Detection of Overlapping Speech for the Purposes of Speaker Diarization , 2019, SPECOM.

[26] Chao Wang,et al. R-CRNN: Region-based Convolutional Recurrent Neural Network for Audio Event Detection , 2018, INTERSPEECH.

[27] Jonathan Le Roux,et al. A Purely End-to-End System for Multi-speaker Speech Recognition , 2018, ACL.

[28] Sivaji Bandyopadhyay,et al. Says Who? Deep Learning Models for Joint Speech Recognition, Segmentation and Diarization , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29] Mei-Yuh Hwang,et al. Region Proposal Network Based Small-Footprint Keyword Spotting , 2019, IEEE Signal Processing Letters.

[30] Hermann Ney,et al. Analysis of Deep Clustering as Preprocessing for Automatic Speech Recognition of Sparsely Overlapping Speech , 2019, INTERSPEECH.

[31] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32] Dong Yu,et al. Recognizing Multi-talker Speech with Permutation Invariant Training , 2017, INTERSPEECH.

[33] Gerald Friedland,et al. Overlapped speech detection for improved speaker diarization in multiparty meetings , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[34] Patrick Kenny,et al. Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[35] Ngoc Thang Vu,et al. End-to-End Multi-Speaker Speech Recognition using Speaker Embeddings and Transfer Learning , 2019, INTERSPEECH.

[36] Shinji Watanabe,et al. End-to-end Monaural Multi-speaker ASR System without Pretraining , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37] Joon Son Chung,et al. VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[38] Naoyuki Kanda,et al. Joint Speaker Counting, Speech Recognition, and Speaker Identification for Overlapped Speech of Any Number of Speakers , 2020, INTERSPEECH.

[39] Sergey Novoselov,et al. Speaker Diarization with Deep Speaker Embeddings for DIHARD Challenge II , 2019, Interspeech.

[40] Andreas Stolcke,et al. Observations on overlap: findings and implications for automatic processing of multi-party conversation , 2001, INTERSPEECH.

[41] Nima Mesgarani,et al. Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[42] Jun Wang,et al. Deep Extractor Network for Target Speaker Recovery From Single Channel Speech Mixtures , 2018, INTERSPEECH.

[43] Jon Barker,et al. The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines , 2018, INTERSPEECH.

[44] Kenneth Ward Church,et al. The Second DIHARD Diarization Challenge: Dataset, task, and baselines , 2019, INTERSPEECH.

[45] Shinji Watanabe,et al. Speaker Diarization with Region Proposal Network , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46] Marijn Huijbregts,et al. The ICSI RT07s Speaker Diarization System , 2007, CLEAR.