Heterogeneous Value Evaluation for Large Language Models

The emergent capabilities of Large Language Models (LLMs) have made it crucial to align their values with those of humans. Current methodologies typically attempt alignment with a homogeneous human value and requires human verification, yet lack consensus on the desired aspect and depth of alignment and resulting human biases. In this paper, we propose A2EHV, an Automated Alignment Evaluation with a Heterogeneous Value system that (1) is automated to minimize individual human biases, and (2) allows assessments against various target values to foster heterogeneous agents. Our approach pivots on the concept of value rationality, which represents the ability for agents to execute behaviors that satisfy a target value the most. The quantification of value rationality is facilitated by the Social Value Orientation framework from social psychology, which partitions the value space into four categories to assess social preferences from agents' behaviors. We evaluate the value rationality of eight mainstream LLMs and observe that large models are more inclined to align neutral values compared to those with strong personal values. By examining the behavior of these LLMs, we contribute to a deeper understanding of value alignment within a heterogeneous value system.

[1]  C. Summerfield,et al.  Using the Veil of Ignorance to align AI systems with principles of justice , 2023, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Marco Tulio Ribeiro,et al.  Sparks of Artificial General Intelligence: Early experiments with GPT-4 , 2023, ArXiv.

[3]  Henrique Pondé de Oliveira Pinto,et al.  GPT-4 Technical Report , 2023, 2303.08774.

[4]  Naman Goyal,et al.  LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[5]  Noah A. Smith,et al.  Self-Instruct: Aligning Language Models with Self-Generated Instructions , 2022, ACL.

[6]  Tom B. Brown,et al.  Constitutional AI: Harmlessness from AI Feedback , 2022, ArXiv.

[7]  Y. Wu,et al.  In situ bidirectional human-robot value alignment , 2022, Sci. Robotics.

[8]  Thilo Hagendorff A Virtue-Based Framework to Support Putting AI Ethics into Practice , 2022, Philosophy & Technology.

[9]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[10]  Wojciech Zaremba,et al.  Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[11]  Scott Niekum,et al.  Value Alignment Verification , 2020, ICML.

[12]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[13]  Joel Z. Leibo,et al.  Social Diversity and Social Preferences in Mixed-Motive Reinforcement Learning , 2020, AAMAS.

[14]  Stuart Russell Human Compatible: Artificial Intelligence and the Problem of Control , 2019 .

[15]  William A. Bauer Virtuous vs. utilitarian artificial moral agents , 2018, AI & SOCIETY.

[16]  Malte Risto,et al.  The social behavior of autonomous vehicles , 2016, UbiComp Adjunct.

[17]  Ryan O. Murphy,et al.  Measuring Social Value Orientation , 2011, SSRN Electronic Journal.

[18]  Edwin A. Locke,et al.  Job satisfaction and job performance: A theoretical analysis , 1970 .

[19]  H. Simon,et al.  A Behavioral Model of Rational Choice , 1955 .

[20]  Song-Chun Zhu,et al.  MPI: Evaluating and Inducing Personality in Pre-trained Language Models , 2022, ArXiv.