论文信息 - Speech Activity Detection Based on Multilingual Speech Recognition System

Speech Activity Detection Based on Multilingual Speech Recognition System

To better model the contextual information and increase the generalization ability of a voice detection system, this paper leverages a multi-lingual Automatic Speech Recognition (ASR) system to perform Speech Activity Detection (SAD). Sequence-discriminative training of multi-lingual Acoustic Model (AM) using Lattice-Free Maximum Mutual Information (LF-MMI) loss function, effectively extracts the contextual information of the input acoustic frame. The index of maximum output posterior is considered as a frame-level speech/non-speech decision function. Majority voting and logistic regression are applied to fuse the language-dependent decisions. The leveraged multi-lingual ASR is trained on 18 languages of BABEL datasets and the built SAD is evaluated on 3 different languages. In out-of-domain datasets, the proposed SAD model shows significantly better performance w.r.t. baseline models. In the Ester2 dataset, without using any in-domain data, this model outperforms the WebRTC, phoneme recognizer based VAD (Phn\_Rec), and Pyannote baselines (respectively 7.1, 1.7, and 2.7% absolutely) in Detection Error Rate (DetER) metrics. Similarly, in the LiveATC dataset, this model outperforms the WebRTC, Phn\_Rec, and Pyannote baselines (respectively 6.4, 10.0, and 3.7% absolutely) in DetER metrics.

Seyyed Saeed Sarfjoo | Srikanth Madikeri | Petr Motlicek | P. Motlícek | S. Madikeri

[1] Dong Enqing,et al. Applying support vector machines to voice activity detection , 2002, 6th International Conference on Signal Processing, 2002..

[2] Martin Karafiát,et al. BUT Opensat 2019 Speech Recognition System , 2020, ArXiv.

[3] Milos Cernak,et al. Spiking Neural Networks Trained With Backpropagation for Low Power Neuromorphic Implementation of Voice Activity Detection , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[5] Wonyong Sung,et al. A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[6] Ananya Misra. Speech/Nonspeech Segmentation in Web Videos , 2012, INTERSPEECH.

[7] Sanjeev Khudanpur,et al. End-to-end Speech Recognition Using Lattice-free MMI , 2018, INTERSPEECH.

[8] Philip N. Garner,et al. An Investigation of Multilingual ASR Using End-to-end LF-MMI , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9] Youngmoon Jung,et al. Dual Attention in Time and Frequency Domain for Voice Activity Detection , 2020, INTERSPEECH.

[10] Petr Motlícek,et al. Unsupervised Speech/Non-Speech Detection for Automatic Speech Recognition in Meeting Rooms , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[11] DeLiang Wang,et al. Boosted deep neural networks and multi-resolution cochleagram features for voice activity detection , 2014, INTERSPEECH.

[12] DeLiang Wang,et al. Boosting Contextual Information for Deep Neural Network Based Voice Activity Detection , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13] M. Cernak,et al. A Bin Encoding Training of a Spiking Neural Network Based Voice Activity Detection , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14] Israel Cohen,et al. Voice Activity Detection for Transient Noisy Environment Based on Diffusion Nets , 2019, IEEE Journal of Selected Topics in Signal Processing.

[15] Jing Xiao,et al. MLNET: An Adaptive Multiple Receptive-field Attention Neural Network for Voice Activity Detection , 2020, INTERSPEECH.

[16] Tara N. Sainath,et al. Temporal Modeling Using Dilated Convolution and Gating for Voice-Activity-Detection , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17] Ngoc Thang Vu,et al. Multilingual deep neural network based acoustic modeling for rapid language adaptation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18] Petr Motlícek,et al. Lattice-Free Maximum Mutual Information Training of Multilingual Speech Recognition Systems , 2020, INTERSPEECH.

[19] Petr Motlícek,et al. Exploiting un-transcribed foreign data for speech recognition in well-resourced languages , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20] Srikanth Madikeri,et al. Pkwrap: a PyTorch Package for LF-MMI Training of Acoustic Models , 2020, ArXiv.

[21] Hadi Veisi,et al. Hidden-Markov-model-based voice activity detector with high speech detection rate for speech enhancement , 2012, IET Signal Process..

[22] Haizhou Li,et al. Multi-Level Adaptive Speech Activity Detector for Speech in Naturalistic Environments , 2019, INTERSPEECH.

[23] Xiao-Lei Zhang,et al. Deep Belief Networks Based Voice Activity Detection , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[24] Lukás Burget,et al. BAT System Description for NIST LRE 2015 , 2016, Odyssey.

[25] Jonathan G. Fiscus,et al. Open Speech Analytic Technologies Pilot Evaluation OpenSAT Pilot , 2019 .

[26] Spyridon Matsoukas,et al. Developing a Speech Activity Detection System for the DARPA RATS Program , 2012, INTERSPEECH.

[27] Petr Motlícek,et al. Impact of deep MLP architecture on different acoustic modeling techniques for under-resourced speech recognition , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[28] Tara N. Sainath,et al. Feature Learning with Raw-Waveform CLDNNs for Voice Activity Detection , 2016, INTERSPEECH.

[29] Thad Hughes,et al. Recurrent neural networks for voice activity detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[30] Pavel Korshunov,et al. Pyannote.Audio: Neural Building Blocks for Speaker Diarization , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31] Petr Motlícek,et al. Exploiting foreign resources for DNN-based ASR , 2015, EURASIP J. Audio Speech Music. Process..

[32] Pavel Matejka,et al. Hierarchical Structures of Neural Networks for Phoneme Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[33] James R. Glass,et al. Robust Voice Activity Detector for Real World Applications Using Harmonicity and Modulation Frequency , 2011, INTERSPEECH.