An Attention Mechanism for Musical Instrument Recognition

While the automatic recognition of musical instruments has seen significant progress, the task is still considered hard for music featuring multiple instruments as opposed to single instrument recordings. Datasets for polyphonic instrument recognition can be categorized into roughly two categories. Some, such as MedleyDB, have strong per-frame instrument activity annotations but are usually small in size. Other, larger datasets such as OpenMIC only have weak labels, i.e., instrument presence or absence is annotated only for long snippets of a song. We explore an attention mechanism for handling weakly labeled data for multi-label instrument recognition. Attention has been found to perform well for other tasks with weakly labeled data. We compare the proposed attention model to multiple models which include a baseline binary relevance random forest, recurrent neural network, and fully connected neural networks. Our results show that incorporating attention leads to an overall improvement in classification accuracy metrics across all 20 instruments in the OpenMIC dataset. We find that attention enables models to focus on (or `attend to') specific time segments in the audio relevant to each instrument label leading to interpretable results.

[1]  Zhi-Hua Zhou,et al.  Neural Networks for Multi-Instance Learning , 2002 .

[2]  Yi-Hsuan Yang,et al.  Multitask Learning for Frame-level Instrument Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  James T. Kwok,et al.  Multilabel Classification with Label Correlations and Missing Labels , 2014, AAAI.

[4]  Alexander Lerch,et al.  MIXING SECRETS : A MULTI-TRACK DATASET FOR INSTRUMENT RECOGNITION IN POLYPHONIC MUSIC , 2017 .

[5]  Justin Salamon,et al.  Adaptive Pooling Operators for Weakly Labeled Sound Event Detection , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6]  Jordi Janer,et al.  A Comparison of Sound Segregation Techniques for Predominant Instrument Recognition in Musical Audio Signals , 2012, ISMIR.

[7]  Yong Xu,et al.  Audio Set Classification with Attention Model: A Probabilistic Perspective , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[10]  Zhi-Hua Zhou,et al.  Multi-instance multi-label learning , 2008, Artif. Intell..

[11]  Masataka Goto,et al.  Instrument Identification in Polyphonic Music: Feature Weighting to Minimize Influence of Sound Overlaps , 2007, EURASIP J. Adv. Signal Process..

[12]  Kun Zhang,et al.  Multi-label learning by exploiting label dependency , 2010, KDD.

[13]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[14]  VirtanenTuomas,et al.  Detection and Classification of Acoustic Scenes and Events , 2018 .

[15]  Heikki Huttunen,et al.  Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Tao Mei,et al.  Correlative multi-label video annotation , 2007, ACM Multimedia.

[17]  Peter Li,et al.  Automatic Instrument Recognition in Polyphonic Music Using Convolutional Neural Networks , 2015, ArXiv.

[18]  Matthias Mauch,et al.  MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research , 2014, ISMIR.

[19]  Ferdinand Fuhrmann Automatic musical instrument recognition from polyphonic music audio signals , 2012 .

[20]  Bhiksha Raj,et al.  Audio Event Detection using Weakly Labeled Data , 2016, ACM Multimedia.

[21]  Juhan Nam,et al.  Sparse feature learning for instrument identification: Effects of sampling and pooling methods. , 2016, The Journal of the Acoustical Society of America.

[22]  Alexander Lerch,et al.  From Labeled to Unlabeled Data - On the Data Challenge in Automatic Drum Transcription , 2018, ISMIR.

[23]  Thomas Hofmann,et al.  Multi-Instance Multi-Label Learning with Application to Scene Classification , 2007 .

[24]  Tao Mei,et al.  Joint multi-label multi-instance learning for image classification , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[26]  Joakim Andén,et al.  Extended playing techniques: the next milestone in musical instrument recognition , 2018, DLfm.

[27]  Brian McFee,et al.  OpenMIC-2018: An Open Data-set for Multiple Instrument Recognition , 2018, ISMIR.

[28]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[29]  Alexander Lerch,et al.  Instrument Activity Detection in Polyphonic Music using Deep Neural Networks , 2018, ISMIR.

[30]  Xin Geng,et al.  Binary relevance for multi-label learning: an overview , 2018, Frontiers of Computer Science.

[31]  Jae-Hun Kim,et al.  Deep Convolutional Neural Networks for Predominant Instrument Recognition in Polyphonic Music , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[32]  Zaïd Harchaoui,et al.  Learning Features of Music from Scratch , 2016, ICLR.

[33]  James R. Foulds,et al.  A review of multi-instance learning assumptions , 2010, The Knowledge Engineering Review.

[34]  Greg Mori,et al.  Learning a Deep ConvNet for Multi-Label Classification With Partial Labels , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Grigorios Tsoumakas,et al.  Multi-Label Classification of Music into Emotions , 2008, ISMIR.

[36]  Masataka Goto,et al.  Instrudive: A Music Visualization System Based on Automatically Recognized Instrumentation , 2018, ISMIR.

[37]  Alexander J. Smola,et al.  Deep Sets , 2017, 1703.06114.

[38]  Tuomas Virtanen,et al.  Sound event detection using spatial features and convolutional recurrent neural network , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Daniel P. W. Ellis,et al.  Feed-Forward Networks with Attention Can Solve Some Long-Term Memory Problems , 2015, ArXiv.

[41]  Perfecto Herrera-Boyer,et al.  Automatic Classification of Musical Instrument Sounds , 2003 .

[42]  Yi-Hsuan Yang,et al.  Frame-level Instrument Recognition by Timbre and Pitch , 2018, ISMIR.

[43]  Mark D. Plumbley,et al.  Weakly labelled AudioSet Classification with Attention Neural Networks. , 2019 .