MER 2023: Multi-label Learning, Modality Robustness, and Semi-Supervised Learning

Over the past few decades, multimodal emotion recognition has made remarkable progress with the development of deep learning. However, existing technologies are difficult to meet the demand for practical applications. To improve the robustness, we launch a Multimodal Emotion Recognition Challenge (MER 2023) to motivate global researchers to build innovative technologies that can further accelerate and foster research. For this year's challenge, we present three distinct sub-challenges: (1) MER-MULTI, in which participants recognize both discrete and dimensional emotions; (2) MER-NOISE, in which noise is added to test videos for modality robustness evaluation; (3) MER-SEMI, which provides large amounts of unlabeled samples for semi-supervised learning. In this paper, we test a variety of multimodal features and provide a competitive baseline for each sub-challenge. Our system achieves 77.57% on the F1 score and 0.82 on the mean squared error (MSE) for MER-MULTI, 69.82% on the F1 score and 1.12 on MSE for MER-NOISE, and 86.75% on the F1 score for MER-SEMI, respectively. Baseline code is available at https://github.com/zeroQiaoba/MER2023-Baseline.

[1]  B. Liu,et al.  Efficient Multimodal Transformer With Dual-Level Feature Restoration for Robust Multimodal Sentiment Analysis , 2022, IEEE Transactions on Affective Computing.

[2]  B. Liu,et al.  GCNet: Graph Completion Network for Incomplete Multimodal Learning in Conversation , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Alan S. Cowen,et al.  MuSe 2022 Challenge: Multimodal Humour, Emotional Reactions, and Stress , 2022, ACM Multimedia.

[4]  B. Liu,et al.  Emotional Reaction Analysis based on Multi-Label Graph Convolutional Networks and Dynamic Facial Expression Recognition Transformer , 2022, MuSe @ ACM Multimedia.

[5]  Chi-Chun Lee,et al.  Exploiting Co-occurrence Frequency of Emotions in Perceptual Evaluations To Train A Speech Emotion Classifier , 2022, INTERSPEECH.

[6]  B. Liu,et al.  PIRNet: Personality-Enhanced Iterative Refinement Network for Emotion Recognition in Conversation , 2022, IEEE Transactions on Neural Networks and Learning Systems.

[7]  Soujanya Poria,et al.  Analyzing Modality Robustness in Multimodal Sentiment Analysis , 2022, NAACL.

[8]  Mamshad Nayeem Rizve,et al.  UNICON: Combating Label Noise Through Uniform Selection and Contrastive Learning , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Limin Wang,et al.  VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training , 2022, NeurIPS.

[10]  Ross B. Girshick,et al.  Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Kyu J. Han,et al.  Performance-Efficiency Trade-Offs in Unsupervised Pre-Training for Speech Recognition , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Huazhu Fu,et al.  Deep Partial Multi-View Learning , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  E. Cambria,et al.  The Biases of Pre-Trained Language Models: An Empirical Study on Prompt-Based Sentiment Analysis and Emotion Detection , 2023, IEEE Transactions on Affective Computing.

[14]  Wenmeng Yu,et al.  Transformer-based Feature Reconstruction Network for Robust Multimodal Sentiment Analysis , 2021, ACM Multimedia.

[15]  Qingshan Liu,et al.  Learning Deep Global Multi-Scale and Local Attention Features for Facial Expression Recognition in the Wild , 2021, IEEE Transactions on Image Processing.

[16]  Ruslan Salakhutdinov,et al.  HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  Guoying Zhao,et al.  The MuSe 2021 Multimodal Sentiment Analysis Challenge: Sentiment, Emotion, Physiological-Emotion, and Stress , 2021, MuSe @ ACM Multimedia.

[18]  Lei Xie,et al.  WeNet: Production Oriented Streaming and Non-Streaming End-to-End Speech Recognition Toolkit , 2021, Interspeech.

[19]  Jianfeng Gao,et al.  DeBERTa: Decoding-enhanced BERT with Disentangled Attention , 2020, ICLR.

[20]  Qin Jin,et al.  Missing Modality Imagination Network for Emotion Recognition with Uncertain Missing Modalities , 2021, ACL.

[21]  Jianhua Tao,et al.  DECN: Dialogical emotion correction network for conversational emotion recognition , 2021, Neurocomputing.

[22]  Jianhua Tao,et al.  CTNet: Conversational Transformer Network for Emotion Recognition , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23]  Abhinav Dhall,et al.  EmotiW 2020: Driver Gaze, Group Emotion, Student Engagement and Physiological Signal based Challenges , 2020, ICMI.

[24]  Yiannis Kompatsiaris,et al.  MuSe 2020 Challenge and Workshop: Multimodal Sentiment Analysis, Emotion-target Engagement and Trustworthiness Detection in Real-life Media: Emotional Car Reviews in-the-wild , 2020, MuSe @ ACM Multimedia.

[25]  Abdel-rahman Mohamed,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[26]  Wanxiang Che,et al.  Revisiting Pre-Trained Models for Chinese Natural Language Processing , 2020, FINDINGS.

[27]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[28]  Xipeng Qiu,et al.  Pre-trained models for natural language processing: A survey , 2020, Science China Technological Sciences.

[29]  S. Zafeiriou,et al.  Analysing Affective Behavior in the First ABAW 2020 Competition , 2020, 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020).

[30]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[31]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Abhinav Dhall,et al.  EmotiW 2019: Automatic Emotion, Engagement and Cohesion Prediction Tasks , 2019, ICMI.

[33]  Jianhua Tao,et al.  Unsupervised Representation Learning with Future Observation Prediction for Speech Emotion Recognition , 2019, INTERSPEECH.

[34]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[35]  Mohammad Soleymani,et al.  AVEC 2019 Workshop and Challenge: State-of-Mind, Detecting Depression with AI, and Cross-Cultural Affect Recognition , 2019, AVEC@MM.

[36]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[37]  Pushpak Bhattacharyya,et al.  Multi-task Learning for Multi-modal Emotion Recognition and Sentiment Analysis , 2019, NAACL.

[38]  Ronan Collobert,et al.  wav2vec: Unsupervised Pre-training for Speech Recognition , 2019, INTERSPEECH.

[39]  Rada Mihalcea,et al.  DialogueRNN: An Attentive RNN for Emotion Detection in Conversations , 2018, AAAI.

[40]  Rada Mihalcea,et al.  MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations , 2018, ACL.

[41]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[42]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[43]  Fabien Ringeval,et al.  AVEC 2018 Workshop and Challenge: Bipolar Disorder and Cross-Cultural Affect Recognition , 2018, AVEC@MM.

[44]  Tamás D. Gedeon,et al.  EmotiW 2018: Audio-Video, Student Engagement and Group-Level Affect Prediction , 2018, ICMI.

[45]  Erik Cambria,et al.  Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph , 2018, ACL.

[46]  Erik Cambria,et al.  Conversational Memory Network for Emotion Recognition in Dyadic Dialogue Videos , 2018, NAACL.

[47]  Erik Cambria,et al.  Recent Trends in Deep Learning Based Natural Language Processing , 2017, IEEE Comput. Intell. Mag..

[48]  Rada Mihalcea,et al.  ICON: Interactive Conversational Memory Network for Multimodal Emotion Detection , 2018, EMNLP.

[49]  Jesse Hoey,et al.  From individual to group-level emotion recognition: EmotiW 5.0 , 2017, ICMI.

[50]  Shizhe Chen,et al.  Multimodal Multi-task Learning for Dimensional and Continuous Emotion Recognition , 2017, AVEC@ACM Multimedia.

[51]  Fabien Ringeval,et al.  AVEC 2017: Real-life Depression, and Affect Recognition Workshop and Challenge , 2017, AVEC@ACM Multimedia.

[52]  Junping Du,et al.  Reliable Crowdsourcing and Deep Locality-Preserving Learning for Expression Recognition in the Wild , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Kevin Gimpel,et al.  A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks , 2016, ICLR.

[54]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[55]  Jesse Hoey,et al.  EmotiW 2016: video and group-level emotion recognition challenges , 2016, ICMI.

[56]  R. Goecke,et al.  Emotion recognition in the wild challenge 2016 , 2016, ICMI.

[57]  Yuxiao Hu,et al.  MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition , 2016, ECCV.

[58]  Fabien Ringeval,et al.  AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge , 2016, AVEC@ACM Multimedia.

[59]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[60]  Peter Robinson,et al.  OpenFace: An open source facial behavior analysis toolkit , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[61]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Tamás D. Gedeon,et al.  Video and Image based Emotion Recognition Challenges in the Wild: EmotiW 2015 , 2015, ICMI.

[63]  Christopher Joseph Pal,et al.  Recurrent Neural Networks for Emotion Recognition in Video , 2015, ICMI.

[64]  Daniel Povey,et al.  MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[65]  Fabien Ringeval,et al.  AVEC 2015: The 5th International Audio/Visual Emotion Challenge and Workshop , 2015, ACM Multimedia.

[66]  Christopher Joseph Pal,et al.  EmoNets: Multimodal deep learning approaches for emotion recognition in video , 2015, Journal on Multimodal User Interfaces.

[67]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[68]  Tamás D. Gedeon,et al.  Emotion Recognition In The Wild Challenge 2014: Baseline, Data and Protocol , 2014, ICMI.

[69]  Björn W. Schuller,et al.  AVEC 2014: 3D Dimensional Affect and Depression Recognition Challenge , 2014, AVEC '14.

[70]  Purnima Chandrasekar,et al.  Automatic Speech Emotion Recognition: A survey , 2014, 2014 International Conference on Circuits, Systems, Communication and Information Technology Applications (CSCITA).

[71]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[72]  Björn W. Schuller,et al.  AVEC 2013: the continuous audio/visual emotion and depression recognition challenge , 2013, AVEC@ACM Multimedia.

[73]  Yoshua Bengio,et al.  Challenges in representation learning: A report on three machine learning contests , 2013, Neural Networks.

[74]  Björn W. Schuller,et al.  AVEC 2012: the continuous audio/visual emotion challenge , 2012, ICMI '12.

[75]  Björn W. Schuller,et al.  AVEC 2011-The First International Audio/Visual Emotion Challenge , 2011, ACII.

[76]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[77]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[78]  Björn W. Schuller,et al.  The INTERSPEECH 2010 paralinguistic challenge , 2010, INTERSPEECH.

[79]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[80]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[81]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[82]  M. Bradley,et al.  Measuring emotion: the Self-Assessment Manikin and the Semantic Differential. , 1994, Journal of behavior therapy and experimental psychiatry.