Multi-modal, Multi-task and Multi-label for Music Genre Classification and Emotion Regression

A smart system is highly desirable with the capability to divide music into coarse and fine categories based on emotion and genre. In this paper, we classify the music based on genre and emotion into 44 class categories which further subdivided into 255 fine categories. The music and lyrics information is input into two separate networks and the global information is integrated for the final classification and regression task. We proposed a channel and filter convolution network by factorizing spatial and temporal interactions of the standard 2D/3D convolutions. Furthermore, the channel interaction of the general residual block is factorized to one. The proposed convolutional method yields significant gains in accuracy and lower computational cost. The network is trained and tested on a public dataset and evaluated for individual and joint representation of audio and lyrics networks.