Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning

Accurate emotion perception is crucial for various applications, including human-computer interaction, education, and counseling. However, traditional single-modality approaches often fail to capture the complexity of real-world emotional expressions, which are inherently multimodal. Moreover, existing Multimodal Large Language Models (MLLMs) face challenges in integrating audio and recognizing subtle facial micro-expressions. To address this, we introduce the MERR dataset, containing 28,618 coarse-grained and 4,487 fine-grained annotated samples across diverse emotional categories. This dataset enables models to learn from varied scenarios and generalize to real-world applications. Furthermore, we propose Emotion-LLaMA, a model that seamlessly integrates audio, visual, and textual inputs through emotion-specific encoders. By aligning features into a shared space and employing a modified LLaMA model with instruction tuning, Emotion-LLaMA significantly enhances both emotional recognition and reasoning capabilities. Extensive evaluations show Emotion-LLaMA outperforms other MLLMs, achieving top scores in Clue Overlap (7.83) and Label Overlap (6.25) on EMER, an F1 score of 0.9036 on MER2023 challenge, and the highest UAR (45.59) and WAR (59.37) in zero-shot evaluations on DFEW dataset.

[1]  Zhi-Qi Cheng,et al.  MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis , 2024, ArXiv.

[2]  Hong-Han Shuai,et al.  EmoVIT: Revolutionizing Emotion Insights with Visual Instruction Tuning , 2024, ArXiv.

[3]  Zhi-Qi Cheng,et al.  MIPS at SemEval-2024 Task 3: Multimodal Emotion-Cause Pair Extraction in Conversations with Multimodal Language Models , 2024, SEMEVAL.

[4]  Chandni Saxena,et al.  JMI at SemEval 2024 Task 3: Two-step approach for multimodal ECAC using in-context learning with GPT and instruction-tuned Llama models , 2024, SEMEVAL.

[5]  Licai Sun,et al.  GPT-4V with emotion: A zero-shot benchmark for Generalized Emotion Recognition , 2023, Inf. Fusion.

[6]  Peng Jin,et al.  Video-LLaVA: Learning United Visual Representation by Alignment Before Projection , 2023, ArXiv.

[7]  Xiaohuan Zhou,et al.  Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models , 2023, ArXiv.

[8]  Dinghao Zhou,et al.  Learning Aligned Audiovisual Representations for Multimodal Sentiment Analysis , 2023, MRAC@MM.

[9]  Haifeng Chen,et al.  Semi-Supervised Multimodal Emotion Recognition with Class-Balanced Pseudo-labeling , 2023, ACM Multimedia.

[10]  Shuyi Mao,et al.  Semi-Supervised Multimodal Emotion Recognition with Expression MAE , 2023, ACM Multimedia.

[11]  Raghuraman Krishnamoorthi,et al.  MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning , 2023, ArXiv.

[12]  Guanting Dong,et al.  InstructERC: Reforming Emotion Recognition in Conversation with a Retrieval Multi-task LLMs Framework , 2023, ArXiv.

[13]  Wenhu Chen,et al.  MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning , 2023, ArXiv.

[14]  Eric Michael Smith,et al.  Llama 2: Open Foundation and Fine-Tuned Chat Models , 2023, ArXiv.

[15]  D. Cohen-Or,et al.  EmoSet: A Large-scale Visual Emotion Dataset with Rich Attributes , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  B. Liu,et al.  MAE-DFER: Efficient Masked Autoencoder for Self-supervised Dynamic Facial Expression Recognition , 2023, ACM Multimedia.

[17]  Feng Zhu,et al.  Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic , 2023, ArXiv.

[18]  Li Dong,et al.  Kosmos-2: Grounding Multimodal Large Language Models to the World , 2023, ArXiv.

[19]  K. Lim,et al.  PAtt-Lite: Lightweight Patch and Attention MobileNet for Challenging Facial Expression Recognition , 2023, IEEE Access.

[20]  Zhongyu Wei,et al.  Valley: Video Assistant with Large Language model Enhanced abilitY , 2023, ArXiv.

[21]  Salman Khan,et al.  Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models , 2023, ArXiv.

[22]  Lidong Bing,et al.  Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding , 2023, EMNLP.

[23]  Yan Wang,et al.  PandaGPT: One Model To Instruction-Follow Them All , 2023, TLLM.

[24]  J. Z. Wang,et al.  Learning Emotion Representations from Verbal and Nonverbal Communication , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Jiannan Wu,et al.  VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks , 2023, NeurIPS.

[26]  Yi Wang,et al.  VideoChat: Chat-Centric Video Understanding , 2023, ArXiv.

[27]  Kalyan Vasudev Alwala,et al.  ImageBind One Embedding Space to Bind Them All , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Yu-Gang Jiang,et al.  Implicit Temporal Modeling with Learnable Alignment for Video Recognition , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Björn Schuller,et al.  MER 2023: Multi-label Learning, Modality Robustness, and Semi-Supervised Learning , 2023, ACM Multimedia.

[30]  Yong Jae Lee,et al.  Visual Instruction Tuning , 2023, NeurIPS.

[31]  Hongsheng Li,et al.  LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention , 2023, ArXiv.

[32]  Yuan-Zheng Wang,et al.  Decoupled Multimodal Distilling for Emotion Recognition , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Mehdi S. M. Sajjadi,et al.  PaLM-E: An Embodied Multimodal Language Model , 2023, ICML.

[34]  Li Dong,et al.  Language Is Not All You Need: Aligning Perception with Language Models , 2023, NeurIPS.

[35]  S. Savarese,et al.  BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , 2023, ICML.

[36]  Suraya Alias,et al.  Beyond Sentiment Analysis: A Review of Recent Trends in Text Based Sentiment Analysis and Emotion Detection , 2023, J. Adv. Comput. Intell. Intell. Informatics.

[37]  V. Kondratenko,et al.  Large Raw Emotional Dataset with Aggregation Mechanism , 2022, 2212.12266.

[38]  Xi Victoria Lin,et al.  OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization , 2022, ArXiv.

[39]  Noah A. Smith,et al.  Self-Instruct: Aligning Language Models with Self-Generated Instructions , 2022, ACL.

[40]  Ledell Yu Wu,et al.  EVA: Exploring the Limits of Masked Visual Representation Learning at Scale , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Andrew M. Dai,et al.  Scaling Instruction-Finetuned Language Models , 2022, ArXiv.

[42]  Feng Zhao,et al.  Intensity-Aware Loss for Dynamic Facial Expression Recognition in the Wild , 2022, AAAI.

[43]  A. Hauptmann,et al.  GSRFormer: Grounded Situation Recognition Transformer with Alternate Semantic Attention Refinement , 2022, ACM Multimedia.

[44]  Xi Victoria Lin,et al.  OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.

[45]  Oriol Vinyals,et al.  Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.

[46]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[47]  Limin Wang,et al.  VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training , 2022, NeurIPS.

[48]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[49]  Ross B. Girshick,et al.  Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Qingshan Liu,et al.  Former-DFER: Dynamic Facial Expression Recognition Transformer , 2021, ACM Multimedia.

[51]  Ruslan Salakhutdinov,et al.  HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[52]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[53]  Xiangmin Xu,et al.  LSSED: A Large-Scale Dataset and Benchmark for Speech Emotion Recognition , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[54]  Wenming Zheng,et al.  DFEW: A Large-Scale Database for Recognizing Dynamic Facial Expressions in the Wild , 2020, ACM Multimedia.

[55]  Tom B. Brown,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[56]  Jianfei Yang,et al.  Suppressing Uncertainties for Large-Scale Facial Expression Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Maryam Imani,et al.  A survey of emotion recognition methods with emphasis on E-Learning environments , 2019, J. Netw. Comput. Appl..

[58]  Peter J. Liu,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[59]  Jun Du,et al.  Exploring Emotion Features and Fusion Strategies for Audio-Video Emotion Recognition , 2019, ICMI.

[60]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[61]  Erik Cambria,et al.  Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph , 2018, ACL.

[62]  Yang Liu,et al.  Video2Shop: Exact Matching Clothes in Videos to Online Shopping Images , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Yang Liu,et al.  Video eCommerce++: Toward Large Scale Online Video Advertising , 2017, IEEE Transactions on Multimedia.

[64]  Lawrence H. Gerstein,et al.  Emotion Recognition, Emotion Expression, and Cultural Display Rules: Implications for Counseling , 2017 .

[65]  Yang Liu,et al.  Video eCommerce: Towards Online Video Advertising , 2016, ACM Multimedia.

[66]  Alexandra Birch,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[67]  H. Ip,et al.  Human Computer Interaction , 2015, Lecture Notes in Computer Science.

[68]  I. Mackenzie Human-Computer Interaction: An Empirical Research Perspective , 2012 .

[69]  Jing Yu Koh,et al.  Grounding Language Models to Images for Multimodal Generation , 2023, ArXiv.

[70]  Yuanzhi Wang,et al.  Incomplete Multimodality-Diffused Emotion Recognition , 2023, NeurIPS.

[71]  Junyang Lin,et al.  Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities , 2023, ArXiv.

[72]  Noah A. Smith,et al.  Benchmarking Generalization via In-Context Instructions on 1, 600+ Language Tasks , 2022, ArXiv.

[73]  Z. Wan,et al.  Psychological Counseling and Character Analysis Algorithm Based on Image Emotion , 2020, IEEE Access.

[74]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.