Speech Activity Detection Based on Multilingual Speech Recognition System

To better model the contextual information and increase the generalization ability of a voice detection system, this paper leverages a multi-lingual Automatic Speech Recognition (ASR) system to perform Speech Activity Detection (SAD). Sequence-discriminative training of multi-lingual Acoustic Model (AM) using Lattice-Free Maximum Mutual Information (LF-MMI) loss function, effectively extracts the contextual information of the input acoustic frame. The index of maximum output posterior is considered as a frame-level speech/non-speech decision function. Majority voting and logistic regression are applied to fuse the language-dependent decisions. The leveraged multi-lingual ASR is trained on 18 languages of BABEL datasets and the built SAD is evaluated on 3 different languages. In out-of-domain datasets, the proposed SAD model shows significantly better performance w.r.t. baseline models. In the Ester2 dataset, without using any in-domain data, this model outperforms the WebRTC, phoneme recognizer based VAD (Phn\_Rec), and Pyannote baselines (respectively 7.1, 1.7, and 2.7% absolutely) in Detection Error Rate (DetER) metrics. Similarly, in the LiveATC dataset, this model outperforms the WebRTC, Phn\_Rec, and Pyannote baselines (respectively 6.4, 10.0, and 3.7% absolutely) in DetER metrics.

[1]  Dong Enqing,et al.  Applying support vector machines to voice activity detection , 2002, 6th International Conference on Signal Processing, 2002..

[2]  Martin Karafiát,et al.  BUT Opensat 2019 Speech Recognition System , 2020, ArXiv.

[3]  Milos Cernak,et al.  Spiking Neural Networks Trained With Backpropagation for Low Power Neuromorphic Implementation of Voice Activity Detection , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[5]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[6]  Ananya Misra Speech/Nonspeech Segmentation in Web Videos , 2012, INTERSPEECH.

[7]  Sanjeev Khudanpur,et al.  End-to-end Speech Recognition Using Lattice-free MMI , 2018, INTERSPEECH.

[8]  Philip N. Garner,et al.  An Investigation of Multilingual ASR Using End-to-end LF-MMI , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Youngmoon Jung,et al.  Dual Attention in Time and Frequency Domain for Voice Activity Detection , 2020, INTERSPEECH.

[10]  Petr Motlícek,et al.  Unsupervised Speech/Non-Speech Detection for Automatic Speech Recognition in Meeting Rooms , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[11]  DeLiang Wang,et al.  Boosted deep neural networks and multi-resolution cochleagram features for voice activity detection , 2014, INTERSPEECH.

[12]  DeLiang Wang,et al.  Boosting Contextual Information for Deep Neural Network Based Voice Activity Detection , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  M. Cernak,et al.  A Bin Encoding Training of a Spiking Neural Network Based Voice Activity Detection , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Israel Cohen,et al.  Voice Activity Detection for Transient Noisy Environment Based on Diffusion Nets , 2019, IEEE Journal of Selected Topics in Signal Processing.

[15]  Jing Xiao,et al.  MLNET: An Adaptive Multiple Receptive-field Attention Neural Network for Voice Activity Detection , 2020, INTERSPEECH.

[16]  Tara N. Sainath,et al.  Temporal Modeling Using Dilated Convolution and Gating for Voice-Activity-Detection , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Ngoc Thang Vu,et al.  Multilingual deep neural network based acoustic modeling for rapid language adaptation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Petr Motlícek,et al.  Lattice-Free Maximum Mutual Information Training of Multilingual Speech Recognition Systems , 2020, INTERSPEECH.

[19]  Petr Motlícek,et al.  Exploiting un-transcribed foreign data for speech recognition in well-resourced languages , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Srikanth Madikeri,et al.  Pkwrap: a PyTorch Package for LF-MMI Training of Acoustic Models , 2020, ArXiv.

[21]  Hadi Veisi,et al.  Hidden-Markov-model-based voice activity detector with high speech detection rate for speech enhancement , 2012, IET Signal Process..

[22]  Haizhou Li,et al.  Multi-Level Adaptive Speech Activity Detector for Speech in Naturalistic Environments , 2019, INTERSPEECH.

[23]  Xiao-Lei Zhang,et al.  Deep Belief Networks Based Voice Activity Detection , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Lukás Burget,et al.  BAT System Description for NIST LRE 2015 , 2016, Odyssey.

[25]  Jonathan G. Fiscus,et al.  Open Speech Analytic Technologies Pilot Evaluation OpenSAT Pilot , 2019 .

[26]  Spyridon Matsoukas,et al.  Developing a Speech Activity Detection System for the DARPA RATS Program , 2012, INTERSPEECH.

[27]  Petr Motlícek,et al.  Impact of deep MLP architecture on different acoustic modeling techniques for under-resourced speech recognition , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[28]  Tara N. Sainath,et al.  Feature Learning with Raw-Waveform CLDNNs for Voice Activity Detection , 2016, INTERSPEECH.

[29]  Thad Hughes,et al.  Recurrent neural networks for voice activity detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[30]  Pavel Korshunov,et al.  Pyannote.Audio: Neural Building Blocks for Speaker Diarization , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Petr Motlícek,et al.  Exploiting foreign resources for DNN-based ASR , 2015, EURASIP J. Audio Speech Music. Process..

[32]  Pavel Matejka,et al.  Hierarchical Structures of Neural Networks for Phoneme Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[33]  James R. Glass,et al.  Robust Voice Activity Detector for Real World Applications Using Harmonicity and Modulation Frequency , 2011, INTERSPEECH.