Neural Network architectures to classify emotions in Indian Classical Music

Music is often considered as the language of emotions. It has long been known to elicit emotions in human being and thus categorizing music based on the type of emotions they induce in human being is a very intriguing topic of research. When the task comes to classify emotions elicited by Indian Classical Music (ICM), it becomes much more challenging because of the inherent ambiguity associated with ICM. The fact that a single musical performance can evoke a variety of emotional response in the audience is implicit to the nature of ICM renditions. With the rapid advancements in the field of Deep Learning, this Music Emotion Recognition (MER) task is becoming more and more relevant and robust, hence can be applied to one of the most challenging test case i.e. classifying emotions elicited from ICM. In this paper we present a new dataset called JUMusEmoDB which presently has 400 audio clips (30 seconds each) where 200 clips correspond to happy emotions and the remaining 200 clips correspond to sad emotion. The initial annotations and emotional classification of the database has been done based on an emotional rating test (5-point Likert scale) performed by 100 participants. The clips have been taken from different conventional 'raga' renditions played in sitar by an eminent maestro of ICM and digitized in 44.1 kHz frequency. The ragas, which are unique to ICM, are described as musical structures capable of inducing different moods or emotions. For supervised classification purposes, we have used 4 existing deep Convolutional Neural Network (CNN) based architectures (resnet18, mobilenet v2.0, squeezenet v1.0 and vgg16) on corresponding music spectrograms of the 2000 sub-clips (where every clip was segmented into 5 sub-clips of about 5 seconds each) which contain both time as well as frequency domain information. The initial results are quite inspiring, and we look forward to setting the baseline values for the dataset using this architecture. This type of CNN based classification algorithm using a rich corpus of Indian Classical Music is unique even in the global perspective and can be replicated in other modalities of music also. This dataset is still under development and we plan to include more data containing other emotional features as well. We plan to make the dataset publicly available soon.

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Yan Liu,et al.  CNN based music emotion classification , 2017, ArXiv.

[4]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Yi-Hsuan Yang,et al.  Machine Recognition of Music Emotion: A Review , 2012, TIST.

[6]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[7]  Zhe Gan,et al.  Variational Autoencoder for Deep Learning of Images, Labels and Captions , 2016, NIPS.

[8]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[9]  Sayan Nag,et al.  Can musical emotion be quantified with neural jitter or shimmer? A novel EEG based study with Hindustani classical music , 2017, 2017 4th International Conference on Signal Processing and Integrated Networks (SPIN).

[10]  VincentPascal,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010 .

[11]  Li Han,et al.  Audio-based deep music emotion recognition , 2018 .

[12]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[13]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[14]  Yi-Hsuan Yang,et al.  Towards time-varying music auto-tagging based on CAL500 expansion , 2014, 2014 IEEE International Conference on Multimedia and Expo (ICME).

[15]  Pierre Baldi,et al.  Autoencoders, Unsupervised Learning, and Deep Architectures , 2011, ICML Unsupervised and Transfer Learning.

[16]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[17]  Sayan Nag,et al.  Music of brain and music on brain: a novel EEG sonification approach , 2017, Cognitive Neurodynamics.

[18]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[19]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[20]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[21]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[22]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[23]  Touhid Bhuiyan,et al.  Music Emotion Recognition with the Extraction of Audio Features Using Machine Learning Approaches , 2019, Proceedings of ICETIT 2019.

[24]  Thomas Pock,et al.  Learning a variational network for reconstruction of accelerated MRI data , 2017, Magnetic resonance in medicine.

[25]  Sayan Nag,et al.  A Simultaneous EEG and EMG Study to Quantify Emotions from Hindustani Classical Music , 2021 .

[26]  Sayan Nag,et al.  A Fractal Approach to Characterize Emotions in Audio and Visual Domain: A Study on Cross-Modal Interaction , 2021, ArXiv.

[27]  Sayan Nag,et al.  Emotion specification from musical stimuli: An EEG study with AFA and DFA , 2017, 2017 4th International Conference on Signal Processing and Integrated Networks (SPIN).

[28]  Sayan Nag,et al.  Hybrid Style Siamese Network: Incorporating style loss in complimentary apparels retrieval , 2019, ArXiv.

[29]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[30]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[31]  Gert R. G. Lanckriet,et al.  Towards musical query-by-semantic-description using the CAL500 data set , 2007, SIGIR.

[32]  György Fazekas,et al.  Music Emotion Recognition: From Content- to Context-Based Models , 2012, CMMR.

[33]  Sayan Nag,et al.  Lookahead optimizer improves the performance of Convolutional Autoencoders for reconstruction of natural images , 2020, ArXiv.