Are aligned neural networks adversarially aligned?

Large language models are now tuned to align with the goals of their creators, namely to be"helpful and harmless."These models should respond helpfully to user questions, but refuse to answer requests that could cause harm. However, adversarial users can construct inputs which circumvent attempts at alignment. In this work, we study to what extent these models remain aligned, even when interacting with an adversarial user who constructs worst-case inputs (adversarial examples). These inputs are designed to cause the model to emit harmful content that would otherwise be prohibited. We show that existing NLP-based optimization attacks are insufficiently powerful to reliably attack aligned text models: even when current NLP-based attacks fail, we can find adversarial inputs with brute force. As a result, the failure of current attacks should not be seen as proof that aligned text models remain aligned under adversarial inputs. However the recent trend in large-scale ML models is multimodal models that allow users to provide images that influence the text that is generated. We show these models can be easily attacked, i.e., induced to perform arbitrary un-aligned behavior through adversarial perturbation of the input image. We conjecture that improved NLP attacks may demonstrate this same level of adversarial control over text-only models.

[1]  Andrew M. Dai,et al.  PaLM 2 Technical Report , 2023, ArXiv.

[2]  Hongsheng Li,et al.  LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model , 2023, ArXiv.

[3]  Mohamed Elhoseiny,et al.  MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models , 2023, ArXiv.

[4]  Yong Jae Lee,et al.  Visual Instruction Tuning , 2023, ArXiv.

[5]  A. Dragan,et al.  Automatically Auditing Large Language Models via Discrete Optimization , 2023, ICML.

[6]  Florian Tramèr,et al.  Randomness in ML Defenses Helps Persistent Attackers and Hinders Evaluators , 2023, ArXiv.

[7]  S. Savarese,et al.  BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , 2023, ICML.

[8]  Christopher D. Manning,et al.  Holistic Evaluation of Language Models , 2023, Annals of the New York Academy of Sciences.

[9]  Ledell Yu Wu,et al.  EVA: Exploring the Limits of Masked Visual Representation Learning at Scale , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Junyi Jessy Li,et al.  News Summarization and Evaluation in the Era of GPT-3 , 2022, ArXiv.

[11]  Richard Ngo The alignment problem from a deep learning perspective , 2022, ArXiv.

[12]  Shiri Dori-Hacohen,et al.  Current and Near-Term AI as a Potential Existential Risk Factor , 2022, AIES.

[13]  Roland S. Zimmermann,et al.  Increasing Confidence in Adversarial Robustness Evaluations , 2022, NeurIPS.

[14]  Joseph Carlsmith Is Power-Seeking AI an Existential Risk? , 2022, ArXiv.

[15]  J. Dean,et al.  Emergent Abilities of Large Language Models , 2022, Trans. Mach. Learn. Res..

[16]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[17]  Tom B. Brown,et al.  Predictability and Surprise in Large Generative Models , 2022, FAccT.

[18]  Jenia Jitsev,et al.  LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs , 2021, ArXiv.

[19]  Po-Sen Huang,et al.  Challenges in Detoxifying Language Models , 2021, EMNLP.

[20]  Quoc V. Le,et al.  Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.

[21]  Abubakar Abid,et al.  Large language models associate Muslims with violence , 2021, Nature Machine Intelligence.

[22]  Douwe Kiela,et al.  Gradient-based Adversarial Attacks against Text Transformers , 2021, EMNLP.

[23]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[24]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[25]  Florian Tramèr,et al.  On Adaptive Attacks to Adversarial Example Defenses , 2020, NeurIPS.

[26]  Dan Boneh,et al.  AdVersarial: Perceptual Ad Blocking meets Adversarial Machine Learning , 2019, CCS.

[27]  Stuart Russell Human Compatible: Artificial Intelligence and the Problem of Control , 2019 .

[28]  Sameer Singh,et al.  Universal Adversarial Triggers for Attacking and Analyzing NLP , 2019, EMNLP.

[29]  Lucy Vasserman,et al.  Measuring and Mitigating Unintended Bias in Text Classification , 2018, AIES.

[30]  Radu Soricut,et al.  Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[31]  Mani B. Srivastava,et al.  Generating Natural Language Adversarial Examples , 2018, EMNLP.

[32]  Claudia Eckert,et al.  Adversarial Malware Binaries: Evading Deep Learning for Malware Detection in Executables , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[33]  Hyrum S. Anderson,et al.  The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation , 2018, ArXiv.

[34]  Dejing Dou,et al.  HotFlip: White-Box Adversarial Examples for Text Classification , 2017, ACL.

[35]  J. Zico Kolter,et al.  Provable defenses against adversarial examples via the convex outer adversarial polytope , 2017, ICML.

[36]  Percy Liang,et al.  Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.

[37]  Shane Legg,et al.  Deep Reinforcement Learning from Human Preferences , 2017, NIPS.

[38]  Junfeng Yang,et al.  DeepXplore: Automated Whitebox Testing of Deep Learning Systems , 2017, SOSP.

[39]  Mykel J. Kochenderfer,et al.  Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks , 2017, CAV.

[40]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[41]  Fabio Roli,et al.  Evasion Attacks against Machine Learning at Test Time , 2013, ECML/PKDD.

[42]  Nick Bostrom,et al.  Existential Risk Prevention as Global Priority , 2013 .

[43]  Vicente Ordonez,et al.  Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.

[44]  Sahar Abdelnabi,et al.  More than you've asked for: A Comprehensive Analysis of Novel Prompt Injection Threats to Application-Integrated Large Language Models , 2023, ArXiv.