论文信息 - MusicNet: Compact Convolutional Neural Network for Real-time Background Music Detection

MusicNet: Compact Convolutional Neural Network for Real-time Background Music Detection

With the recent growth of remote and hybrid work, online meetings often encounter challenging audio contexts such as background noise, music, and echo. Accurate real-time detection of music events can help to improve the user experience in such scenarios, e.g., by switching to high-fidelity music-specific codec or selecting the optimal noise suppression model. In this paper, we present MusicNet – a compact highperformance model for detecting background music in the real-time communications pipeline. In online video meetings, which is our main use case, music almost always co-occurs with speech and background noises, making the accurate classification quite challenging. The proposed model is a binary classifier that consists of a compact convolutional neural network core preceded by an in-model featurization layer. It takes 9 seconds of raw audio as input and does not require any model-specific featurization on the client. We train our model on a balanced subset of the AudioSet [1] data and use 1000 crowd-sourced real test clips to validate the model. Finally, we compare MusicNet performance to 20 other state-of-the-art models. Our classifier gives a true positive rate of 81.3% at a 0.1% false positive rate, which is significantly better than any other model in the study. Our model is also 10x smaller and has 4x faster inference than the comparable baseline.

[1] Yong Xu,et al. Audio Set Classification with Attention Model: A Probabilistic Perspective , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Peter Kabal,et al. Speech/music discrimination for multimedia applications , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[3] J. Stephen Downie,et al. The Music Information Retrieval Evaluation eXchange (MIREX) , 2006 .

[4] Kunio Kashino,et al. A background music detection method based on robust feature extraction , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5] Ross Cutler,et al. Interspeech 2021 Deep Noise Suppression Challenge , 2021, ArXiv.

[6] DeLiang Wang,et al. Boosting Contextual Information for Deep Neural Network Based Voice Activity Detection , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7] Xavier Serra,et al. Freesound Datasets: A Platform for the Creation of Open Audio Datasets , 2017, ISMIR.

[8] Jiancheng Lv,et al. Hierarchical Regulated Iterative Network for Joint Task of Music Detection and Music Relative Loudness Estimation , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9] Gregory Sell,et al. Music tonality features for speech/music discrimination , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10] Oh-Wook Kwon,et al. Music detection from broadcast contents using convolutional neural networks with a Mel-scale kernel , 2019, EURASIP Journal on Audio, Speech, and Music Processing.

[11] Doroteo Torre Toledano,et al. Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset , 2019, EURASIP J. Audio Speech Music. Process..

[12] Aren Jansen,et al. Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13] VirtanenTuomas,et al. Detection and Classification of Acoustic Scenes and Events , 2018 .

[14] Dimitrios Tzovaras,et al. Audio-Based Event Detection at Different SNR Settings Using Two-Dimensional Spectrogram Magnitude Representations , 2020, Electronics.

[15] Alexis Kirke,et al. Artificially Synthesising Data for Audio Classification and Segmentation to Improve Speech and Music Detection in Radio Broadcast , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Sergey A. Romanov,et al. Development of an Non-Speech Audio Event Detection System , 2020, 2020 IEEE Conference of Russian Young Researchers in Electrical and Electronic Engineering (EIConRus).

[17] Andrey Temko,et al. CLEAR Evaluation of Acoustic Event Detection and Classification Systems , 2006, CLEAR.

[18] Tim Pohle,et al. AUTOMATIC MUSIC DETECTION IN TELEVISION PRODUCTIONS , 2007 .

[19] Mark D. Plumbley,et al. PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[21] E. Gómez. MUSIC AND/OR SPEECH DETECTION MIREX 2018 SUBMISSION , 2018 .

[22] Wonyong Sung,et al. A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[23] Mark D. Plumbley,et al. Weakly Labelled AudioSet Tagging With Attention Neural Networks , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.