MusicNet: Compact Convolutional Neural Network for Real-time Background Music Detection

With the recent growth of remote and hybrid work, online meetings often encounter challenging audio contexts such as background noise, music, and echo. Accurate real-time detection of music events can help to improve the user experience in such scenarios, e.g., by switching to high-fidelity music-specific codec or selecting the optimal noise suppression model. In this paper, we present MusicNet – a compact highperformance model for detecting background music in the real-time communications pipeline. In online video meetings, which is our main use case, music almost always co-occurs with speech and background noises, making the accurate classification quite challenging. The proposed model is a binary classifier that consists of a compact convolutional neural network core preceded by an in-model featurization layer. It takes 9 seconds of raw audio as input and does not require any model-specific featurization on the client. We train our model on a balanced subset of the AudioSet [1] data and use 1000 crowd-sourced real test clips to validate the model. Finally, we compare MusicNet performance to 20 other state-of-the-art models. Our classifier gives a true positive rate of 81.3% at a 0.1% false positive rate, which is significantly better than any other model in the study. Our model is also 10x smaller and has 4x faster inference than the comparable baseline.

[1]  Yong Xu,et al.  Audio Set Classification with Attention Model: A Probabilistic Perspective , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Peter Kabal,et al.  Speech/music discrimination for multimedia applications , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[3]  J. Stephen Downie,et al.  The Music Information Retrieval Evaluation eXchange (MIREX) , 2006 .

[4]  Kunio Kashino,et al.  A background music detection method based on robust feature extraction , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Ross Cutler,et al.  Interspeech 2021 Deep Noise Suppression Challenge , 2021, ArXiv.

[6]  DeLiang Wang,et al.  Boosting Contextual Information for Deep Neural Network Based Voice Activity Detection , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Xavier Serra,et al.  Freesound Datasets: A Platform for the Creation of Open Audio Datasets , 2017, ISMIR.

[8]  Jiancheng Lv,et al.  Hierarchical Regulated Iterative Network for Joint Task of Music Detection and Music Relative Loudness Estimation , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Gregory Sell,et al.  Music tonality features for speech/music discrimination , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Oh-Wook Kwon,et al.  Music detection from broadcast contents using convolutional neural networks with a Mel-scale kernel , 2019, EURASIP Journal on Audio, Speech, and Music Processing.

[11]  Doroteo Torre Toledano,et al.  Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset , 2019, EURASIP J. Audio Speech Music. Process..

[12]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  VirtanenTuomas,et al.  Detection and Classification of Acoustic Scenes and Events , 2018 .

[14]  Dimitrios Tzovaras,et al.  Audio-Based Event Detection at Different SNR Settings Using Two-Dimensional Spectrogram Magnitude Representations , 2020, Electronics.

[15]  Alexis Kirke,et al.  Artificially Synthesising Data for Audio Classification and Segmentation to Improve Speech and Music Detection in Radio Broadcast , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Sergey A. Romanov,et al.  Development of an Non-Speech Audio Event Detection System , 2020, 2020 IEEE Conference of Russian Young Researchers in Electrical and Electronic Engineering (EIConRus).

[17]  Andrey Temko,et al.  CLEAR Evaluation of Acoustic Event Detection and Classification Systems , 2006, CLEAR.

[18]  Tim Pohle,et al.  AUTOMATIC MUSIC DETECTION IN TELEVISION PRODUCTIONS , 2007 .

[19]  Mark D. Plumbley,et al.  PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[21]  E. Gómez MUSIC AND/OR SPEECH DETECTION MIREX 2018 SUBMISSION , 2018 .

[22]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[23]  Mark D. Plumbley,et al.  Weakly Labelled AudioSet Tagging With Attention Neural Networks , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.