Learning Multi-instrument Classification with Partial Labels

Multi-instrument recognition is the task of predicting the presence or absence of different instruments within an audio clip. A considerable challenge in applying deep learning to multi-instrument recognition is the scarcity of labeled data. OpenMIC is a recent dataset containing 20K polyphonic audio clips. The dataset is weakly labeled, in that only the presence or absence of instruments is known for each clip, while the onset and offset times are unknown. The dataset is also partially labeled, in that only a subset of instruments are labeled for each clip. In this work, we investigate the use of attention-based recurrent neural networks to address the weakly-labeled problem. We also use different data augmentation methods to mitigate the partially-labeled problem. Our experiments show that our approach achieves state-of-the-art results on the OpenMIC multi-instrument recognition task.

[1]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Rogério Schmidt Feris,et al.  Learning to Separate Object Sounds by Watching Unlabeled Video , 2018, ECCV.

[3]  Amy Loutfi,et al.  A review of unsupervised feature learning and deep learning for time-series modeling , 2014, Pattern Recognit. Lett..

[4]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[5]  Mohit Sharma,et al.  An Attention Mechanism for Musical Instrument Recognition , 2019, ISMIR.

[6]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[7]  Bhiksha Raj,et al.  Audio Event Detection using Weakly Labeled Data , 2016, ACM Multimedia.

[8]  Juhan Nam,et al.  Sparse feature learning for instrument identification: Effects of sampling and pooling methods. , 2016, The Journal of the Acoustical Society of America.

[9]  Peter Li,et al.  Automatic Instrument Recognition in Polyphonic Music Using Convolutional Neural Networks , 2015, ArXiv.

[10]  Matthias Mauch,et al.  MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research , 2014, ISMIR.

[11]  Zoubin Ghahramani,et al.  A Theoretically Grounded Application of Dropout in Recurrent Neural Networks , 2015, NIPS.

[12]  Perfecto Herrera-Boyer,et al.  Automatic Classification of Musical Instrument Sounds , 2003 .

[13]  Meinard Mller,et al.  Fundamentals of Music Processing: Audio, Analysis, Algorithms, Applications , 2015 .

[14]  Annamaria Mesaros,et al.  Metrics for Polyphonic Sound Event Detection , 2016 .

[15]  Masataka Goto,et al.  Instrument Identification in Polyphonic Music: Feature Weighting to Minimize Influence of Sound Overlaps , 2007, EURASIP J. Adv. Signal Process..

[16]  Brian McFee,et al.  OpenMIC-2018: An Open Data-set for Multiple Instrument Recognition , 2018, ISMIR.

[17]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[18]  Zaïd Harchaoui,et al.  Learning Features of Music from Scratch , 2016, ICLR.

[19]  Joakim Andén,et al.  Extended playing techniques: the next milestone in musical instrument recognition , 2018, DLfm.

[20]  Yi-Hsuan Yang,et al.  Frame-level Instrument Recognition by Timbre and Pitch , 2018, ISMIR.

[21]  Yong Xu,et al.  Audio Set Classification with Attention Model: A Probabilistic Perspective , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Zhouyu Fu,et al.  A Survey of Audio-Based Music Classification and Annotation , 2011, IEEE Transactions on Multimedia.

[24]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.