Attention Augmented CNNs for Musical Instrument Identification

We study the effectiveness of attention augmented convolutional neural networks for musical instrument identification in audio, which is an unsolved problem. Attention augmentation has not previously been applied to this task. The proposed architecture augments the final convolution modules from a baseline convolutional template with attention mechanisms. The network contains five total convolutional modules followed by five dense layers. The final layer uses softmax output to categorize 19 different orchestral instruments. Attention is introduced to enhance the network's ability to extract the casual structure underlying the formation of the spectrograms. We manipulate the ratio of attention augmentation to convolution in order to assess the efficacy of adding attention in this particular task. Input to the network is a 2D sound spectrogram of a 1s duration audio file taken from the London Philharmonic Orchestra and the University of Iowa Musical Instrument datasets. Experiments use two different spectrogram types, CQT and STFT, to assess their relative merits. Results show that the networks augmented with 25 % of their filters for attention are able to outperform their only-convolutional counterparts and achieve 95.09 % and 92.40% overall accuracy for STFT and CQT input spectrograms, respectively. The convolution only models achieve 84.94 % and 91.43 % accuracy, respectively.