Fusion of Hilbert-Huang Transform and Deep Convolutional Neural Network for Predominant Musical Instruments Recognition

As a subset of music information retrieval (MIR), predominant musical instruments recognition (PMIR) has attracted substantial interest in recent years due to its uniqueness and high commercial value in key areas of music analysis such as music retrieval and automatic music transcription. With the attention paid to deep learning and artificial intelligence, they have been more and more widely applied in the field of MIR, thus making breakthroughs in some sub-fields that have been stuck in the bottleneck. In this paper, the Hilbert-Huang Transform (HHT) is employed to map one-dimensional audio data into two-dimensional matrix format, followed by a deep convolutional neural network developed to learn affluent and effective features for PMIR. In total 6705 audio pieces including 11 musical instruments are used to validate the efficacy of our proposed approach. The results are compared to four benchmarking methods and show significant improvements in terms of precision, recall and F1 measures.

[1]  Nii O. Attoh-Okine,et al.  Comparative study of Hilbert–Huang transform, Fourier transform and wavelet transform in pavement profile analysis , 2009 .

[2]  Dattatraya S. Bormane,et al.  Automatic musical instrument classification using fractional fourier transform based- MFCC features and counter propagation neural network , 2015, Journal of Intelligent Information Systems.

[3]  Ping Ma,et al.  A stability constrained adaptive alpha for gravitational search algorithm , 2018, Knowl. Based Syst..

[4]  Stephen Marshall,et al.  Cognitive Fusion of Thermal and Visible Imagery for Effective Detection and Tracking of Pedestrians in Videos , 2018, Cognitive Computation.

[5]  Meinard Müller,et al.  Fundamentals of Music Processing , 2015, Springer International Publishing.

[6]  Jordi Janer,et al.  A Comparison of Sound Segregation Techniques for Predominant Instrument Recognition in Musical Audio Signals , 2012, ISMIR.

[7]  Brian McFee,et al.  OpenMIC-2018: An Open Data-set for Multiple Instrument Recognition , 2018, ISMIR.

[8]  Peter Li,et al.  Automatic Instrument Recognition in Polyphonic Music Using Convolutional Neural Networks , 2015, ArXiv.

[9]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Jae-Hun Kim,et al.  Deep Convolutional Neural Networks for Predominant Instrument Recognition in Polyphonic Music , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  Roberto Battiti,et al.  First- and Second-Order Methods for Learning: Between Steepest Descent and Newton's Method , 1992, Neural Computation.

[12]  Olga Slizovskaia,et al.  Automatic musical instrument recognition in audiovisual recordings by combining image and audio classification strategies , 2016 .

[13]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[14]  Matthias Mauch,et al.  MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research , 2014, ISMIR.

[15]  Soo Young Cho,et al.  A Single Predominant Instrument Recognition of Polyphonic Music Using CNN-based Timbre Analysis , 2018 .

[16]  Mert Bay,et al.  The Music Information Retrieval Evaluation eXchange: Some Observations and Insights , 2010, Advances in Music Information Retrieval.

[17]  Julie M. Liss,et al.  Hilbert spectral analysis of vowels using intrinsic mode functions , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[18]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[19]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[20]  Yi-Hsuan Yang,et al.  Multitask Learning for Frame-level Instrument Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Miguel Angel Ferrer-Ballester,et al.  A Novel Approach to String Instrument Recognition , 2018, ICISP.

[22]  Alexander Lerch An introduction to audio content analysis , 2012 .

[23]  Peijun Du,et al.  Novel segmented stacked autoencoder for effective dimensionality reduction and feature extraction in hyperspectral imaging , 2016, Neurocomputing.