论文信息 - Emotionally Numb or Empathetic? Evaluating How LLMs Feel Using EmotionBench

Emotionally Numb or Empathetic? Evaluating How LLMs Feel Using EmotionBench

Recently, the community has witnessed the advancement of Large Language Models (LLMs), which have shown remarkable performance on various downstream tasks. Led by powerful models like ChatGPT and Claude, LLMs are revolutionizing how users engage with software, assuming more than mere tools but intelligent assistants. Consequently, evaluating LLMs' anthropomorphic capabilities becomes increasingly important in contemporary discourse. Utilizing the emotion appraisal theory from psychology, we propose to evaluate the empathy ability of LLMs, i.e., how their feelings change when presented with specific situations. After a careful and comprehensive survey, we collect a dataset containing over 400 situations that have proven effective in eliciting the eight emotions central to our study. Categorizing the situations into 36 factors, we conduct a human evaluation involving more than 1,200 subjects worldwide. With the human evaluation results as references, our evaluation includes five LLMs, covering both commercial and open-source models, including variations in model sizes, featuring the latest iterations, such as GPT-4 and LLaMA 2. A conclusion can be drawn from the results that, despite several misalignments, LLMs can generally respond appropriately to certain situations. Nevertheless, they fall short in alignment with the emotional behaviors of human beings and cannot establish connections between similar situations. Our collected dataset of situations, the human evaluation results, and the code of our testing framework, dubbed EmotionBench, is made publicly in https://github.com/CUHK-ARISE/EmotionBench. We aspire to contribute to the advancement of LLMs regarding better alignment with the emotional behaviors of human beings, thereby enhancing their utility and applicability as intelligent assistants.

[1] Eric Michael Smith,et al. Llama 2: Open Foundation and Fine-Tuned Chat Models , 2023, ArXiv.

[2] Jindong Wang,et al. EmotionPrompt: Leveraging Psychology for Large Language Models Enhancement via Emotional Stimulus , 2023, ArXiv.

[3] Y. Shoham,et al. Generating Benchmarks for Factuality Evaluation of Language Models , 2023, EACL.

[4] Chao Wang,et al. Systematic Testing of the Data-Poisoning Robustness of KNN , 2023, ISSTA.

[5] Aleksandra Faust,et al. Personality Traits in Large Language Models , 2023, ArXiv.

[6] Deyi Xiong,et al. CBBQ: A Chinese Bias Benchmark Dataset Curated with Human-AI Collaboration for Large Language Models , 2023, ArXiv.

[7] R. Feldt,et al. Towards Autonomous Testing Agents via Conversational Large Language Models , 2023, ArXiv.

[8] Tianwei Zhang,et al. Prompt Injection attack against LLM-integrated Applications , 2023, ArXiv.

[9] Bojana Bodroža,et al. Personality testing of GPT-3: Limited temporal reliability, but highlighted social desirability of GPT-3's personality instruments results , 2023, ArXiv.

[10] N. Gong,et al. PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts , 2023, ArXiv.

[11] Eric Schulz,et al. Turning large language models into cognitive models , 2023, ICLR.

[12] Perry Gibson,et al. A Differential Testing Framework to Evaluate Image Recognition Model Robustness , 2023, arXiv.org.

[13] Wenxiang Jiao,et al. ChatGPT an ENFJ, Bard an ISTJ: Empirical Study on Personalities of Large Language Models , 2023, ArXiv.

[14] Dan Jurafsky,et al. Marked Personas: Using Natural Language Prompts to Measure Stereotypes in Language Models , 2023, ACL.

[15] Christopher D. Manning,et al. MQuAKE: Assessing Knowledge Editing in Language Models via Multi-Hop Questions , 2023, ArXiv.

[16] Pinjia He,et al. BiasAsker: Measuring the Bias in Conversational AI System , 2023, ESEC/SIGSOFT FSE.

[17] Yiling Lou,et al. No More Manual Tests? Evaluating and Improving ChatGPT for Unit Test Generation , 2023, ArXiv.

[18] Xiajie Zhang,et al. PersonaLLM: Investigating the Ability of GPT-3.5 to Express Personality Traits and Gender Differences , 2023, arXiv.org.

[19] Van-Thuan Pham,et al. Metamorphic Testing of Machine Translation Models using Back Translation , 2023, 2023 IEEE/ACM International Workshop on Deep Learning for Testing and Testing for Deep Learning (DeepTest).

[20] Shuvendu K. Lahiri,et al. CodaMosa: Escaping Coverage Plateaus in Test Generation with Pre-trained Large Language Models , 2023, 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE).

[21] Zeynep Akata,et al. Inducing anxiety in large language models increases exploration and bias , 2023, ArXiv.

[22] Markus Pauly,et al. The Self-Perception and Political Biases of ChatGPT , 2023, Human Behavior and Emerging Technologies.

[23] Vishvak S. Murahari,et al. Toxicity in ChatGPT: Analyzing Persona-assigned Language Models , 2023, EMNLP.

[24] Xiaoyuan Xie,et al. qaAskeR $$^+$$ + : a novel testing method for question answering software via asking recursive questions , 2023, Autom. Softw. Eng..

[25] Marco Tulio Ribeiro,et al. Sparks of Artificial General Intelligence: Early experiments with GPT-4 , 2023, ArXiv.

[26] Hao Wu,et al. ChatGPT or Grammarly? Evaluating ChatGPT on Grammatical Error Correction Benchmark , 2023, ArXiv.

[27] Henrique Pondé de Oliveira Pinto,et al. GPT-4 Technical Report , 2023, 2303.08774.

[28] C. Miao,et al. Can ChatGPT Assess Human Personalities? A General Evaluation Framework , 2023, ArXiv.

[29] Weibin Wu,et al. MTTM: Metamorphic Testing for Textual Content Moderation Software , 2023, 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE).

[30] Lingming Zhang,et al. Large Language Models Are Zero-Shot Fuzzers: Fuzzing Deep-Learning Libraries via Large Language Models , 2022, ISSTA.

[31] Fitash Ul Haq,et al. Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems , 2022, 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE).

[32] J Zhang,et al. Natural Test Generation for Precise Testing of Question Answering Software , 2022, ASE.

[33] Bennett Kleinberg,et al. Who is GPT-3? An exploration of personality, values and demographics , 2022, NLPCSS.

[34] Juyeon Yoon,et al. Large Language Models are Few-shot Testers: Exploring LLM-based General Bug Reproduction , 2022, 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE).

[35] T. Luck,et al. The wide variety of reasons for feeling guilty in adults: findings from a large cross-sectional web-based survey , 2022, BMC psychology.

[36] Bill Ryan,et al. When employees feel envy: The role of psychological capital , 2022, International Journal of Hospitality Management.

[37] Siau-Cheng Khoo,et al. DeepSuite: A Test Suite Optimizer for Autonomous Vehicles , 2022, IEEE transactions on intelligent transportation systems (Print).

[38] Shin Hwei Tan,et al. Automated Repair of Programs from Large Language Models , 2022, 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE).

[39] Yixin Zhu,et al. Evaluating and Inducing Personality in Pre-trained Language Models , 2022, 2206.07550.

[40] Yuxin Su,et al. AEON: a method for automatic evaluation of NLP test cases , 2022, International Symposium on Software Testing and Analysis.

[41] Matthew B. Dwyer,et al. White-box Testing of NLP models with Mask Neuron Coverage , 2022, NAACL-HLT.

[42] Nan Niu,et al. Metamorphic Testing of Image Classification and Consistency Analysis Using Clustering , 2022, Int. J. Multim. Data Eng. Manag..

[43] Shuo Jin,et al. Testing Your Question Answering Software via Asking Recursively , 2021, 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[44] Morgan J. Sidari,et al. Why are some people more jealous than others? Genetic and environmental factors , 2021, Evolution and Human Behavior.

[45] Paolo Tonella,et al. DeepCrime: mutation testing of deep learning systems based on real faults , 2021, ISSTA.

[46] Miryung Kim,et al. BMT: Behavior Driven Development-based Metamorphic Testing for Autonomous Driving Models , 2021, 2021 IEEE/ACM 6th International Workshop on Metamorphic Testing (MET).

[47] K. Simpson,et al. "My cheeks get red and my brain gets scared": A computer assisted interview to explore experiences of anxiety in young children on the autism spectrum. , 2021, Research in developmental disabilities.

[48] D. Klein,et al. Calibrate Before Use: Improving Few-Shot Performance of Language Models , 2021, ICML.

[49] T. Chen,et al. A Declarative Metamorphic Testing Framework for Autonomous Driving , 2020, IEEE Transactions on Software Engineering.

[50] Z. Su,et al. Testing Machine Translation via Referential Transparency , 2020, International Conference on Software Engineering.

[51] T. Chen,et al. Metamorphic Testing: A New Approach for Generating Next Test Cases , 2020, ArXiv.

[52] Mark Harman,et al. Machine Learning Testing: Survey, Landscapes and Horizons , 2019, IEEE Transactions on Software Engineering.

[53] G. Kaiser,et al. Testing DNN Image Classifiers for Confusion & Bias Errors , 2019, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[54] S. Bouchard,et al. Exposure to a Standardized Catastrophic Scenario in Virtual Reality or a Personalized Scenario in Imagination for Generalized Anxiety Disorder , 2019, Journal of clinical medicine.

[55] Jingyi Wang,et al. Adversarial Sample Detection for Deep Neural Network through Model Mutation Testing , 2018, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[56] Foutse Khomh,et al. On Testing Machine Learning Programs , 2018, J. Syst. Softw..

[57] Sarfraz Khurshid,et al. DeepRoad: GAN-Based Metamorphic Testing and Input Validation Framework for Autonomous Driving Systems , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[58] Raja Ben Abdessalem,et al. Testing Autonomous Cars for Feature Interaction Failures using Many-Objective Search , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[59] Yue Zhao,et al. DLFuzz: differential fuzzing testing of deep learning systems , 2018, ESEC/SIGSOFT FSE.

[60] Lei Ma,et al. DeepMutation: Mutation Testing of Deep Learning Systems , 2018, 2018 IEEE 29th International Symposium on Software Reliability Engineering (ISSRE).

[61] Lei Ma,et al. DeepGauge: Multi-Granularity Testing Criteria for Deep Learning Systems , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[62] Suman Jana,et al. DeepTest: Automated Testing of Deep-Neural-Network-Driven Autonomous Cars , 2017, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[63] Yasuyuki Taki,et al. Comprehensive neural networks for guilty feelings in young adults , 2015, NeuroImage.

[64] K. Scherer,et al. Appraisal Theories of Emotion: State of the Art and Future Development , 2013 .

[65] E. Holmes,et al. Developing a measure of interpretation bias for depressed mood: An ambiguous scenarios test , 2011, Personality and individual differences.

[66] J. Harrigan,et al. Interactions among situations, neuroticism, and appraisals in coping strategy choice , 2010 .

[67] Ryan C. Martin,et al. The angry cognitions scale: a new inventory for assessing cognitions in anger , 2007 .

[68] Mark J.M. Sullman,et al. Anger amongst New Zealand drivers , 2006 .

[69] Neil Harrington,et al. The Frustration Discomfort Scale: development and psychometric properties , 2005 .

[70] J. Henry,et al. The short-form version of the Depression Anxiety Stress Scales (DASS-21): construct validity and normative data in a large non-clinical sample. , 2005, The British journal of clinical psychology.

[71] R. Nesse,et al. Is low mood an adaptation? Evidence for subtypes with symptoms that match precipitants. , 2005, Journal of Affective Disorders.

[72] R. Davidson. Affective neuroscience and psychophysiology: toward a synthesis. , 2003, Psychophysiology.