A Single Predominant Instrument Recognition of Polyphonic Music Using CNN-based Timbre Analysis

Classifying musical instrument from polyphonic music is a challenging but important task in music information retrieval. This work enables to automatically tag music information, such as genre classification. In previous, almost every work of spectrogram analysis has been used Short Time Fourier Transform (STFT) and Mel Frequency Cepstral Coefficient (MFCC). Recently, sparkgram is researched and used in audio source analysis. Moreover, for deep learning approach, modified convolutional neural networks (CNN) widely have been researched, but many results have not been improved drastically. Instead of improving backbone networks, we have researched on preprocessing process. In this paper, we use CNN and Hilbert Spectral Analysis (HSA) to solve the polyphonic music problem. The HSA is performed at the fixed length of polyphonic music, and a predominant instrument is labeled at its result. As result, we have achieved the state-of-the-art result in IRMAS dataset and 3% performance improvement in individual instruments

[1]  Julie M. Liss,et al.  Hilbert spectral analysis of vowels using intrinsic mode functions , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[2]  Jont B. Allen,et al.  Short term spectral analysis, synthesis, and modification by discrete Fourier transform , 1977 .

[3]  N. Huang,et al.  The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis , 1998, Proceedings of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[4]  Anssi Klapuri,et al.  Musical instrument recognition using cepstral coefficients and temporal features , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[5]  Alicja Wieczorkowska,et al.  Music Information Retrieval , 2009, Encyclopedia of Data Warehousing and Mining.

[6]  Alain Rakotomamonjy,et al.  Histogram of gradients of Time-Frequency Representations for Audio scene detection , 2015, ArXiv.

[7]  John S. Boreczky,et al.  A hidden Markov model framework for video segmentation using audio and image features , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[8]  Chin Kim On,et al.  Mel-frequency cepstral coefficient analysis in speech recognition , 2006, 2006 International Conference on Computing & Informatics.

[9]  Paris Smaragdis,et al.  Deep learning for monaural speech separation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Steven Henikoff,et al.  SIFT: predicting amino acid changes that affect protein function , 2003, Nucleic Acids Res..

[11]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[12]  C. Joder,et al.  A Conditional Random Field Framework for Robust and Scalable Audio-to-Score Matching , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Manuel Duarte Ortigueira,et al.  On the HHT, its problems, and some solutions , 2008 .

[14]  Honglak Lee,et al.  Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[15]  Qi Tian,et al.  HMM-Based Audio Keyword Generation , 2004, PCM.

[16]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[17]  Jae-Hun Kim,et al.  Deep Convolutional Neural Networks for Predominant Instrument Recognition in Polyphonic Music , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.