Influence of input data representations for time-dependent instrument recognition

Abstract An important preprocessing step for several music signal processing algorithms is the estimation of playing instruments in music recordings. To this aim, time-dependent instrument recognition is realized by a neural network with residual blocks in this approach. Since music signal processing tasks use diverse time-frequency representations as input matrices, the influence of different input representations for instrument recognition is analyzed in this work. Three-dimensional inputs of short-time Fourier transform (STFT) magnitudes and an additional time-frequency representation based on phase information are investigated as well as two-dimensional STFT or constant-Q transform (CQT) magnitudes. As additional phase representations, the product spectrum (PS), based on the modified group delay, and the frequency error (FE) matrix, related to the instantaneous frequency, are used. Training and evaluation processes are executed based on the MusicNet dataset, which enables the estimation of seven instruments. With a higher number of frequency bins in the input representations, an improved instrument recognition of about 2 % in F1-score can be achieved. Compared to the literature, frame-level instrument recognition can be improved for different input representations.

[1]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Fernando Puente León,et al.  Notenseparation in polyphonen Musiksignalen durch einen Matching-Pursuit-Algorithmus / Note separation in polyphonic music signals with a matching pursuit algorithm , 2018 .

[3]  Satoshi Nakamura,et al.  Efficient representation of short-time phase based on group delay , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[4]  Jae-Hun Kim,et al.  Deep Convolutional Neural Networks for Predominant Instrument Recognition in Polyphonic Music , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Fernando Puente León,et al.  Incorporation of phase information for improved time-dependent instrument recognition , 2020 .

[6]  Jordi Janer,et al.  A Comparison of Sound Segregation Techniques for Predominant Instrument Recognition in Musical Audio Signals , 2012, ISMIR.

[7]  Xavier Serra,et al.  Timbre analysis of music audio signals with convolutional neural networks , 2017, 2017 25th European Signal Processing Conference (EUSIPCO).

[8]  Kuldip K. Paliwal,et al.  Product of power spectrum and group delay function for speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Yi-Hsuan Yang,et al.  Multitask Learning for Frame-level Instrument Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Yi-Hsuan Yang,et al.  Frame-level Instrument Recognition by Timbre and Pitch , 2018, ISMIR.

[11]  Zaïd Harchaoui,et al.  Learning Features of Music from Scratch , 2016, ICLR.

[12]  Benjamin Schrauwen,et al.  End-to-end learning for music audio , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Judith C. Brown Calculation of a constant Q spectral transform , 1991 .

[14]  Hema A. Murthy,et al.  Group delay based music source separation using deep recurrent neural networks , 2016, 2016 International Conference on Signal Processing and Communications (SPCOM).