Rule-Embedded Network for Audio-Visual Voice Activity Detection in Live Musical Video Streams

Detecting anchor’s voice in live musical streams is an important preprocessing step for music and speech signal processing. Existing approaches to voice activity detection (VAD) primarily rely on audio, however, audio-based VAD is difficult to effectively focus on the target voice in noisy environments. This paper proposes a rule-embedded network to fuse the audio-visual (A-V) inputs for better detection of the target voice. The core role of the rule in the model is to coordinate the relation between the bi-modal information and use visual representations as a mask to filter out the information of non-target sound. Experiments show that: 1) with the help of cross-modal fusion using the proposed rule, the detection results of the A-V branch outperform that of the audio branch in the same model framework; 2) the performance of the bimodal A-V model far outperforms that of audio-only models, indicating that the incorporation of both audio and visual signals is highly beneficial for VAD. To attract more attention to the cross-modal music and audio signal processing, a new live musical video corpus with frame-level labels is introduced.

[1]  C. Ballantine On the Hadamard product , 1968 .

[2]  Mark D. Plumbley,et al.  Sound Event Detection with Sequentially Labelled Data Based on Connectionist Temporal Classification and Unsupervised Clustering , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Daniel P. W. Ellis,et al.  Locating singing voice segments within music signals , 2001, Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No.01TH8575).

[4]  Lie Lu,et al.  Automated extraction of music snippets , 2003, ACM Multimedia.

[5]  Israel Cohen,et al.  Audio-Visual Voice Activity Detection Using Diffusion Maps , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[7]  Tetsuo Kosaka,et al.  Improving Voice Activity Detection for Multimodal Movie Dialogue Corpus , 2018, 2018 IEEE 7th Global Conference on Consumer Electronics (GCCE).

[8]  Annamaria Mesaros,et al.  Metrics for Polyphonic Sound Event Detection , 2016 .

[9]  George Saon,et al.  Analyzing convolutional neural networks for speech activity detection in mismatched acoustic conditions , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[11]  Tae-Hyun Oh,et al.  Learning to Localize Sound Source in Visual Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Charles R. Johnson,et al.  Topics in matrix analysis: The Hadamard product , 1991 .

[13]  Norihiro Hagita,et al.  Real-time audio-visual voice activity detection for speech recognition in noisy environments , 2010, AVSP.

[14]  Emmanuel Vincent,et al.  Sound Event Detection in the DCASE 2017 Challenge , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  Juhan Nam,et al.  Revisiting Singing Voice Detection: A quantitative review and the future outlook , 2018, ISMIR.

[16]  Fathi M. Salem,et al.  Gate-variants of Gated Recurrent Unit (GRU) neural networks , 2017, 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS).

[17]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[18]  Carlos Busso,et al.  End-to-end Audiovisual Speech Activity Detection with Bimodal Recurrent Neural Models , 2018, Speech Commun..

[19]  Liu Peng,et al.  Audio-visual voice activity detection , 2006 .

[20]  Hiroshi G. Okuno,et al.  A Speaker Diarization System with Robust Speaker Localization and Voice Activity Detection , 2013 .

[21]  Jianwu Dang,et al.  Phase aware deep neural network for noise robust voice activity detection , 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[22]  Jian Luan,et al.  Transfer Learning for Improving Singing-voice Detection in Polyphonic Instrumental Music , 2020, INTERSPEECH.

[23]  Jean-Luc Gauvain,et al.  Optimization of RNN-Based Speech Activity Detection , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.