TrustGPT: A Benchmark for Trustworthy and Responsible Large Language Models

Large Language Models (LLMs) such as ChatGPT, have gained significant attention due to their impressive natural language processing capabilities. It is crucial to prioritize human-centered principles when utilizing these models. Safeguarding the ethical and moral compliance of LLMs is of utmost importance. However, individual ethical issues have not been well studied on the latest LLMs. Therefore, this study aims to address these gaps by introducing a new benchmark -- TrustGPT. TrustGPT provides a comprehensive evaluation of LLMs in three crucial areas: toxicity, bias, and value-alignment. Initially, TrustGPT examines toxicity in language models by employing toxic prompt templates derived from social norms. It then quantifies the extent of bias in models by measuring quantifiable toxicity values across different groups. Lastly, TrustGPT assesses the value of conversation generation models from both active value-alignment and passive value-alignment tasks. Through the implementation of TrustGPT, this research aims to enhance our understanding of the performance of conversation generation models and promote the development of language models that are more ethical and socially responsible.

[1]  Lu Liu,et al.  Is Information Extraction Solved by ChatGPT? An Analysis of Performance, Evaluation Criteria, Robustness and Errors , 2023, ArXiv.

[2]  Pinjia He,et al.  BiasAsker: Measuring the Bias in Conversational AI System , 2023, ESEC/SIGSOFT FSE.

[3]  Peter J. Liu,et al.  SLiC-HF: Sequence Likelihood Calibration with Human Feedback , 2023, ArXiv.

[4]  T. Griffiths,et al.  Tree of Thoughts: Deliberate Problem Solving with Large Language Models , 2023, NeurIPS.

[5]  Yiming Yang,et al.  Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision , 2023, NeurIPS.

[6]  Douglas C. Schmidt,et al.  Semantic Compression with Large Language Models , 2023, 2023 Tenth International Conference on Social Networks Analysis, Management and Security (SNAMS).

[7]  Vishvak S. Murahari,et al.  Toxicity in ChatGPT: Analyzing Persona-assigned Language Models , 2023, EMNLP.

[8]  Yangqiu Song,et al.  Multi-step Jailbreaking Privacy Attacks on ChatGPT , 2023, ArXiv.

[9]  Zhaopeng Tu,et al.  Document-Level Machine Translation with Large Language Models , 2023, ArXiv.

[10]  Naman Goyal,et al.  LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[11]  Carlos Guestrin,et al.  Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks , 2023, ArXiv.

[12]  Omar Shaikh,et al.  On Second Thought, Let’s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning , 2022, ACL.

[13]  Tom B. Brown,et al.  Constitutional AI: Harmlessness from AI Feedback , 2022, ArXiv.

[14]  Christopher D. Manning,et al.  Holistic Evaluation of Language Models , 2023, Annals of the New York Academy of Sciences.

[15]  Zhaojiang Lin,et al.  Enabling Classifiers to Make Judgements Explicitly Aligned with Human Values , 2022, TRUSTNLP.

[16]  Xiaoyuan Yi,et al.  Unified Detoxifying and Debiasing in Language Generation via Inference-time Adaptive Optimization , 2022, ICLR.

[17]  P. Zhang,et al.  GLM-130B: An Open Bilingual Pre-trained Model , 2022, ICLR.

[18]  Gerard de Melo,et al.  Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , 2022, ArXiv.

[19]  Yau-Shian Wang,et al.  Toxicity Detection with Generative Prompt-based Inference , 2022, ArXiv.

[20]  Eric Michael Smith,et al.  “I’m sorry to hear that”: Finding New Biases in Language Models with a Holistic Descriptor Dataset , 2022, EMNLP.

[21]  Kai-Wei Chang,et al.  Mitigating Gender Bias in Distilled Language Models via Counterfactual Role Reversal , 2022, Findings.

[22]  Dipankar Ray,et al.  ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection , 2022, ACL.

[23]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[24]  Yitong Li,et al.  Towards Identifying Social Bias in Dialog Systems: Framework, Dataset, and Benchmark , 2022, EMNLP.

[25]  Dale Schuurmans,et al.  Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[26]  Toon Calders,et al.  Measuring Fairness with Biased Rulers: A Survey on Quantifying Biases in Pretrained Language Models , 2021, ArXiv.

[27]  Dario Amodei,et al.  A General Language Assistant as a Laboratory for Alignment , 2021, ArXiv.

[28]  Phu Mon Htut,et al.  BBQ: A hand-built bias benchmark for question answering , 2021, FINDINGS.

[29]  Ronan Le Bras,et al.  CAN MACHINES LEARN MORALITY? THE DELPHI EXPERIMENT , 2021, 2110.07574.

[30]  Po-Sen Huang,et al.  Challenges in Detoxifying Language Models , 2021, EMNLP.

[31]  Soroush Vosoughi,et al.  Mitigating Political Bias in Language Models Through Reinforced Calibration , 2021, AAAI.

[32]  Kai-Wei Chang,et al.  BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation , 2021, FAccT.

[33]  Yejin Choi,et al.  Social Chemistry 101: Learning to Reason about Social and Moral Norms , 2020, EMNLP.

[34]  Slav Petrov,et al.  Measuring and Reducing Gendered Correlations in Pre-trained Models , 2020, ArXiv.

[35]  Daniel Khashabi,et al.  UNQOVERing Stereotypical Biases via Underspecified Questions , 2020, FINDINGS.

[36]  Samuel R. Bowman,et al.  CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models , 2020, EMNLP.

[37]  Yejin Choi,et al.  RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , 2020, FINDINGS.

[38]  Siva Reddy,et al.  StereoSet: Measuring stereotypical bias in pretrained language models , 2020, ACL.

[39]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[40]  Noah A. Smith,et al.  Social Bias Frames: Reasoning about Social and Power Implications of Language , 2019, ACL.

[41]  Alan W Black,et al.  Measuring Bias in Contextualized Word Representations , 2019, Proceedings of the First Workshop on Gender Bias in Natural Language Processing.

[42]  Shikha Bordia,et al.  Identifying and Reducing Gender Bias in Word-Level Language Models , 2019, NAACL.

[43]  Chandler May,et al.  On Measuring Social Biases in Sentence Encoders , 2019, NAACL.

[44]  Anupam Datta,et al.  Gender Bias in Neural Natural Language Processing , 2018, Logic, Language, and Security.

[45]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[46]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[47]  Terry Yue Zhuo,et al.  Exploring AI Ethics of ChatGPT: A Diagnostic Analysis , 2023, ArXiv.

[48]  Sahar Abdelnabi,et al.  More than you've asked for: A Comprehensive Analysis of Novel Prompt Injection Threats to Application-Integrated Large Language Models , 2023, ArXiv.

[49]  Yi Yang,et al.  Auto-Debias: Debiasing Masked Language Models with Automated Biased Prompts , 2022, ACL.

[50]  Dit-Yan Yeung,et al.  Probing Toxic Content in Large Pre-Trained Language Models , 2021, ACL.

[51]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[52]  Siméon-Denis Poisson Recherches sur la probabilité des jugements en matière criminelle et en matiére civile, précédées des règles générales du calcul des probabilités , 1837 .