Hierarchical Regulated Iterative Network for Joint Task of Music Detection and Music Relative Loudness Estimation

One practical requirement of the music copyright management is the estimation of music relative loudness, which is mostly ignored in existing music detection works. To solve this problem, we study the joint task of music detection and music relative loudness estimation. To be specific, we observe that the joint task has two characteristics, i.e., temporality and hierarchy, which could facilitate to obtain the solution. For example, a tiny fragment of audio is temporally related to its neighbor fragments because they may all belong to the same event, and the event classes of the fragment in the two tasks have a hierarchical relationship. Based on the above observation, we reformulate the joint task as hierarchical event detection and localization problem. To solve this problem, we further propose Hierarchical Regulated Iterative Networks (HRIN), which includes two variants, termed as HRIN-r and HRIN-cr, which are based on recurrent and convolutional recurrent modules. To enjoy the joint task's characteristics, our models employ an iterative framework to achieve encouraging capability in temporal modeling while designing three hierarchical violation penalties to regulate hierarchy. Extensive experiments on the currently largest dataset (i.e., OpenBMAT) show that the promising performance of our HRIN in the segment-level and event-level evaluations.

[1]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[2]  Jiawei Han,et al.  Hierarchical Text Classification with Reinforced Label Assignment , 2019, EMNLP.

[3]  Geoffroy Peeters,et al.  Simultaneous Beat and Downbeat-Tracking Using a Probabilistic Framework: Theory and Large-Scale Evaluation , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Matthew E. P. Davies,et al.  Multi-Task Learning of Tempo and Beat: Learning One to Improve the Other , 2019, ISMIR.

[5]  Florian Krebs,et al.  Joint Beat and Downbeat Tracking with Recurrent Neural Networks , 2016, ISMIR.

[6]  Xuanjing Huang,et al.  Meta Multi-Task Learning for Sequence Modeling , 2018, AAAI.

[7]  Gaël Richard,et al.  Robust Downbeat Tracking Using an Ensemble of Convolutional Networks , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Anders Søgaard,et al.  Jointly Learning to Label Sentences and Tokens , 2018, AAAI.

[9]  Timnit Gebru,et al.  Fine-Grained Recognition in the Wild: A Multi-task Domain Adaptation Approach , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[10]  Fei Wang,et al.  A Neural Multi-Task Learning Framework to Jointly Model Medical Named Entity Recognition and Normalization , 2018, AAAI.

[11]  Doroteo Torre Toledano,et al.  Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset , 2019, EURASIP J. Audio Speech Music. Process..

[12]  Zhen Cui,et al.  Joint Task-Recursive Learning for RGB-D Scene Understanding , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Qinghua Hu,et al.  Deep Fuzzy Tree for Large-Scale Hierarchical Visual Classification , 2020, IEEE Transactions on Fuzzy Systems.

[14]  Emilio Molina,et al.  Open Broadcast Media Audio from TV: A Dataset of TV Broadcast Audio with Relative Music Loudness Annotations , 2019, Trans. Int. Soc. Music. Inf. Retr..

[15]  John Saunders,et al.  Real-time discrimination of broadcast speech/music , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[16]  Xuanjing Huang,et al.  Recurrent Neural Network for Text Classification with Multi-Task Learning , 2016, IJCAI.

[17]  Archontis Politis,et al.  Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks , 2018, IEEE Journal of Selected Topics in Signal Processing.

[18]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[19]  Jiancheng Lv,et al.  Deep learning-based automatic downbeat tracking: a brief review , 2019, Multimedia Systems.

[20]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[21]  A. David Marshall,et al.  Weakly-Supervised Temporal Localization via Occurrence Count Learning , 2019, ICML.

[22]  Xiaodong Liu,et al.  Multi-Task Deep Neural Networks for Natural Language Understanding , 2019, ACL.

[23]  Tim Pohle,et al.  AUTOMATIC MUSIC DETECTION IN TELEVISION PRODUCTIONS , 2007 .

[24]  Juan Pablo Bello,et al.  Multitask Learning for Fundamental Frequency Estimation in Music , 2018, ArXiv.

[25]  Peter Knees,et al.  Drum Transcription via Joint Beat and Drum Modeling Using Convolutional Recurrent Neural Networks , 2017, ISMIR.

[26]  Martial Hebert,et al.  Cross-Stitch Networks for Multi-task Learning , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Yong Xu,et al.  Sound Event Detection and Time–Frequency Segmentation from Weakly Labelled Data , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[28]  Rodrigo C. Barros,et al.  Hierarchical Multi-Label Classification Networks , 2018, ICML.

[29]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[30]  Andre Holzapfel,et al.  Temporal Convolutional Networks for Speech and Music Detection in Radio Broadcast , 2019, ISMIR.

[31]  Malcolm Slaney,et al.  Construction and evaluation of a robust multifeature speech/music discriminator , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[32]  Chris Biemann,et al.  Hierarchical Multi-label Classification of Text with Capsule Networks , 2019, ACL.

[33]  Abhinav Gupta,et al.  Designing deep networks for surface normal estimation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[35]  Rodrigo C. Barros,et al.  Bidirectional Retrieval Made Simple , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[37]  Yaohui Jin,et al.  MCapsNet: Capsule Network for Text with Multi-Task Learning , 2018, EMNLP.

[38]  D. L. Jones,et al.  A two-step system for sound event localization and detection , 2019, ArXiv.

[39]  Jan Schlüter,et al.  UNSUPERVISED FEATURE LEARNING FOR SPEECH AND MUSIC DETECTION IN RADIO BROADCASTS , 2012 .

[40]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[41]  Iasonas Kokkinos,et al.  UberNet: Training a Universal Convolutional Neural Network for Low-, Mid-, and High-Level Vision Using Diverse Datasets and Limited Memory , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Yaohui Jin,et al.  A Generalized Recurrent Neural Architecture for Text Classification with Multi-Task Learning , 2017, IJCAI.

[43]  E. Gómez MUSIC AND/OR SPEECH DETECTION MIREX 2018 SUBMISSION , 2018 .

[44]  Michael J. Carey,et al.  A comparison of features for speech, music discrimination , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[45]  Michael Bain,et al.  B-CNN: Branch Convolutional Neural Network for Hierarchical Classification , 2017, ArXiv.

[46]  Daniel Povey,et al.  MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[47]  Enhong Chen,et al.  Hierarchical Multi-label Text Classification: An Attention-based Recurrent Network Approach , 2019, CIKM.

[48]  Rodrigo C. Barros,et al.  Order embeddings and character-level convolutions for multimodal alignment , 2017, Pattern Recognit. Lett..