Multimodal Fusion of EEG and Musical Features in Music-Emotion Recognition

Multimodality has been recently exploited to overcome the challenges of emotion recognition. In this paper, we present a study of fusion of electroencephalogram (EEG) features and musical features extracted from musical stimuli at decision level in recognizing the time-varying binary classes of arousal and valence. Our empirical results demonstrate that EEG modality was suffered from the non-stability of EEG signals, yet fusing with music modality could alleviate the issue and enhance the performance of emotion recognition. Electroencephalogram (EEG), a tool to capture brainwaves, has been recently used tool to estimate human emotional states but confronts with a variety of challenges. Recent efforts to reinforce the emotion recognition model include using EEG features in conjunction with other information sources (D’mello and Kory 2015), such as facial expression, and peripheral signals. One possible solution is to exploit information regarding the felt emotion in conjunction with the expressed emotion in music to estimate emotional states. In this paper, we propose a methodology to fuse dynamic information from physiological signals and musical contents at decision level (or late integration) based on the assumption that both modalities could play a complementary role in music-emotion recognition model. We found that the performance of continuously estimating emotional response in music listening using both modalities outperformed that using only EEG unimodality. Research Methodology Experimental Protocol Twelve healthy male volunteers (averaged age = 25.59 y, SD = 1.69 y) were recruited to participate in our experiment. Each subject was instructed to listen to the self-selected 16 MIDI songs. Simultaneously, EEG signals were acquired from the 12 electrodes of Waveguard EEG cap placed in accordance with the 10-20 international system. The positions of the selected electrodes were nearby the frontal lobe. Throughout EEG recording, Cz electrode was used as a reference and the impedance of each electrode was kept below 20 kΩ. EEG signals were recorded at a 250 Hz sampling Copyright c © 2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. rate. A 0.5-60 Hz bandpass filter was also applied. Each subject was also asked to keep his eyes close and minimize body movement during EEG recording to reduce any effect of unrelated artifacts. After music listening, each subject was instructed to annotate his felt emotions in the previous session by continuously clicking at a corresponding point in the arousal-valence emotion space, a continuous space actively used to describe emotions (Russell 1980), shown on a monitor screen using a mouse. Arousal describes emotional intensity ranging from calm (-1) to activated (+1) emotion whereas valence describes positivity of emotion ranging from unpleasant (-1) to pleasant (+1). EEG and Musical Features To extract features from EEG signals, we applied the fractal dimension (FD) approach. FD is a non-negative real value that quantifies the complexity and irregularity of data and can be used to reveal the complexity of a timevarying EEG signal. We applied Higuchi algorithm (Higuchi 1988) to derive FD features from each particular window, namely FD, FD, FD, FD, FD, FD, FD, FD, FD, FD, FD , and FD named in accordance with electrode name. Based on previous study (Thammasan et al. 2016), asymmetry indexes, namely FD–FD, FD–FD, FD–FD, FD– FD, and FD–FD, were also added into our original feature set. To extract musical features from MIDI songs, we employed the MIRtoolbox (Lartillot and Toiviainen 2007). A dynamic feature of a song was derived from the framebased root mean square of the amplitude. Rhythm is the pattern of pulses/note of varying strength. We extracted the frame-based tempo estimation and the attack times and slopes of the onsets from songs. Timbre reflects the spectrotemporal characteristics of sound. We extracted the spectral roughness that measures the noisiness of the spectrum, 13 Mel-frequency cepstral coefficients and their derivatives up to the 1 order. In addition, we extracted the framedecomposed zero-crossing rate, the low energy rate and the frame-decomposed spectral flux from songs. To extract tonal characteristics, we calculated the frame-decomposed key clarity, mode, and the harmonic change detection function from songs. Afterward, we calculated the means of features in each window and retrieved 37 musical features in total. Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)