Self-Destructing Models: Increasing the Costs of Harmful Dual Uses of Foundation Models

A growing ecosystem of large, open-source foundation models has reduced the labeled data and technical expertise necessary to apply machine learning to many new problems. Yet foundation models pose a clear dual-use risk, indiscriminately reducing the costs of building both harmful and beneficial machine learning systems. Policy tools such as restricted model access and export controls are the primary methods currently used to mitigate such dual-use risks. In this work, we review potential safe-release strategies and argue that both policymakers and AI researchers would benefit from fundamentally new technologies enabling more precise control over the downstream usage of open-source foundation models. We propose one such approach: the task blocking paradigm, in which foundation models are trained with an additional mechanism to impede adaptation to harmful tasks without sacrificing performance on desirable tasks. We call the resulting models self-destructing models, inspired by mechanisms that prevent adversaries from using tools for harmful purposes. We present an algorithm for training self-destructing models leveraging techniques from meta-learning and adversarial learning, which we call meta-learned adversarial censoring (MLAC). In a small-scale experiment, we show MLAC can largely prevent a BERT-style model from being re-purposed to perform gender identification without harming the model’s ability to perform profession classification.

[1]  Naman Goyal,et al.  LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[2]  Ethan Perez,et al.  Pretraining Language Models with Human Preferences , 2023, ICML.

[3]  P. Hacker,et al.  Regulating ChatGPT and other Large Generative AI Models , 2023, FAccT.

[4]  Irene Solaiman The Gradient of Generative AI Release: Methods and Considerations , 2023, FAccT.

[5]  Luke Zettlemoyer,et al.  The case for 4-bit precision: k-bit Inference Scaling Laws , 2022, ICML.

[6]  J. Steinhardt,et al.  How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios , 2022, NeurIPS.

[7]  Florian Tramèr,et al.  Red-Teaming the Stable Diffusion Safety Filter , 2022, ArXiv.

[8]  M. Lewis,et al.  LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale , 2022, ArXiv.

[9]  Christopher D. Manning,et al.  Memory-Based Model Editing at Scale , 2022, ICML.

[10]  Xi Victoria Lin,et al.  OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.

[11]  Stella Rose Biderman,et al.  GPT-NeoX-20B: An Open-Source Autoregressive Language Model , 2022, BIGSCIENCE.

[12]  Tom B. Brown,et al.  Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , 2022, ArXiv.

[13]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[14]  S. Ekins,et al.  Dual use of artificial-intelligence-powered drug discovery , 2022, Nature Machine Intelligence.

[15]  Yoav Goldberg,et al.  Linear Adversarial Concept Erasure , 2022, ICML.

[16]  Yoav Goldberg,et al.  Adversarial Concept Erasure in Kernel Space , 2022, EMNLP.

[17]  James Y. Zou,et al.  Improving Out-of-Distribution Robustness via Selective Augmentation , 2022, ICML.

[18]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Christopher D. Manning,et al.  Fast Model Editing at Scale , 2021, ICLR.

[20]  Michael S. Bernstein,et al.  On the Opportunities and Risks of Foundation Models , 2021, ArXiv.

[21]  Yelong Shen,et al.  LoRA: Low-Rank Adaptation of Large Language Models , 2021, ICLR.

[22]  Nicola De Cao,et al.  Editing Factual Knowledge in Language Models , 2021, EMNLP.

[23]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[24]  Brahim Chaib-draa,et al.  Domain Generalization with Optimal Transport and Metric Learning , 2020, ArXiv.

[25]  Peter Henderson,et al.  Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims , 2020, ArXiv.

[26]  Artem Babenko,et al.  Editable Neural Networks , 2020, ICLR.

[27]  Carrick Flynn Recommendations on Export Controls for Artificial Intelligence , 2020 .

[28]  R. Zwetsloot Keeping Top AI Talent in the United States , 2019 .

[29]  Stephen P. Boyd,et al.  Differentiable Convex Optimization Layers , 2019, NeurIPS.

[30]  Aviv Ovadya,et al.  The tension between openness and prudence in AI research , 2019, ArXiv.

[31]  Ming-Wei Chang,et al.  Well-Read Students Learn Better: On the Importance of Pre-training Compact Models , 2019 .

[32]  Artem Molchanov,et al.  Generalized Inner Loop Meta-Learning , 2019, ArXiv.

[33]  Andrei A. Rusu,et al.  Meta-Learning with Warped Gradient Descent , 2019, ICLR.

[34]  Alec Radford,et al.  Release Strategies and the Social Impacts of Language Models , 2019, ArXiv.

[35]  Jess Whittlestone,et al.  Reducing malicious use of synthetic media research: Considerations and potential release practices for machine learning , 2019, ArXiv.

[36]  Junier B. Oliva,et al.  Meta-Curvature , 2019, NeurIPS.

[37]  Mona Attariyan,et al.  Parameter-Efficient Transfer Learning for NLP , 2019, ICML.

[38]  Alexandra Chouldechova,et al.  Bias in Bios: A Case Study of Semantic Representation Bias in a High-Stakes Setting , 2019, FAT.

[39]  Katja Hofmann,et al.  Fast Context Adaptation via Meta-Learning , 2018, ICML.

[40]  Daniel Jurafsky,et al.  Deconfounded Lexicon Induction for Interpretable Social Science , 2018, NAACL.

[41]  Alex ChiChung Kot,et al.  Domain Generalization with Adversarial Feature Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42]  Hyrum S. Anderson,et al.  The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation , 2018, ArXiv.

[43]  Seungjin Choi,et al.  Gradient-Based Meta-Learning with Learned Layerwise Metric and Subspace , 2018, ICML.

[44]  Hang Li,et al.  Meta-SGD: Learning to Learn Quickly for Few Shot Learning , 2017, ArXiv.

[45]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[46]  Laurent Orseau,et al.  Safely Interruptible Agents , 2016, UAI.

[47]  Stephen P. Boyd,et al.  CVXPY: A Python-Embedded Modeling Language for Convex Optimization , 2016, J. Mach. Learn. Res..

[48]  Amos J. Storkey,et al.  Censoring Representations with an Adversary , 2015, ICLR.

[49]  Victor S. Lempitsky,et al.  Unsupervised Domain Adaptation by Backpropagation , 2014, ICML.

[50]  Yoshua Bengio,et al.  Algorithms for Hyper-Parameter Optimization , 2011, NIPS.

[51]  Toby Shevlane,et al.  Structured access to AI capabilities: an emerging paradigm for safe AI deployment , 2022, ArXiv.

[52]  Percy Liang,et al.  Prefix-Tuning: Optimizing Continuous Prompts for Generation , 2021, ACL.

[53]  A. Linear-probe,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021 .

[54]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[55]  David D. Cox,et al.  Hyperopt: A Python Library for Optimizing the Hyperparameters of Machine Learning Algorithms , 2013, SciPy.