Are aligned neural networks adversarially aligned?
暂无分享,去创建一个
Pang Wei Koh | Christopher A. Choquette-Choo | Florian Tramèr | Nicholas Carlini | Ludwig Schmidt | Katherine Lee | Daphne Ippolito | Milad Nasr | Irena Gao | Matthew Jagielski | Anas Awadalla
[1] Andrew M. Dai,et al. PaLM 2 Technical Report , 2023, ArXiv.
[2] Hongsheng Li,et al. LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model , 2023, ArXiv.
[3] Mohamed Elhoseiny,et al. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models , 2023, ArXiv.
[4] Yong Jae Lee,et al. Visual Instruction Tuning , 2023, ArXiv.
[5] A. Dragan,et al. Automatically Auditing Large Language Models via Discrete Optimization , 2023, ICML.
[6] Florian Tramèr,et al. Randomness in ML Defenses Helps Persistent Attackers and Hinders Evaluators , 2023, ArXiv.
[7] S. Savarese,et al. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , 2023, ICML.
[8] Christopher D. Manning,et al. Holistic Evaluation of Language Models , 2023, Annals of the New York Academy of Sciences.
[9] Ledell Yu Wu,et al. EVA: Exploring the Limits of Masked Visual Representation Learning at Scale , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[10] Junyi Jessy Li,et al. News Summarization and Evaluation in the Era of GPT-3 , 2022, ArXiv.
[11] Richard Ngo. The alignment problem from a deep learning perspective , 2022, ArXiv.
[12] Shiri Dori-Hacohen,et al. Current and Near-Term AI as a Potential Existential Risk Factor , 2022, AIES.
[13] Roland S. Zimmermann,et al. Increasing Confidence in Adversarial Robustness Evaluations , 2022, NeurIPS.
[14] Joseph Carlsmith. Is Power-Seeking AI an Existential Risk? , 2022, ArXiv.
[15] J. Dean,et al. Emergent Abilities of Large Language Models , 2022, Trans. Mach. Learn. Res..
[16] Ryan J. Lowe,et al. Training language models to follow instructions with human feedback , 2022, NeurIPS.
[17] Tom B. Brown,et al. Predictability and Surprise in Large Generative Models , 2022, FAccT.
[18] Jenia Jitsev,et al. LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs , 2021, ArXiv.
[19] Po-Sen Huang,et al. Challenges in Detoxifying Language Models , 2021, EMNLP.
[20] Quoc V. Le,et al. Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.
[21] Abubakar Abid,et al. Large language models associate Muslims with violence , 2021, Nature Machine Intelligence.
[22] Douwe Kiela,et al. Gradient-based Adversarial Attacks against Text Transformers , 2021, EMNLP.
[23] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[24] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[25] Florian Tramèr,et al. On Adaptive Attacks to Adversarial Example Defenses , 2020, NeurIPS.
[26] Dan Boneh,et al. AdVersarial: Perceptual Ad Blocking meets Adversarial Machine Learning , 2019, CCS.
[27] Stuart Russell. Human Compatible: Artificial Intelligence and the Problem of Control , 2019 .
[28] Sameer Singh,et al. Universal Adversarial Triggers for Attacking and Analyzing NLP , 2019, EMNLP.
[29] Lucy Vasserman,et al. Measuring and Mitigating Unintended Bias in Text Classification , 2018, AIES.
[30] Radu Soricut,et al. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.
[31] Mani B. Srivastava,et al. Generating Natural Language Adversarial Examples , 2018, EMNLP.
[32] Claudia Eckert,et al. Adversarial Malware Binaries: Evading Deep Learning for Malware Detection in Executables , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).
[33] Hyrum S. Anderson,et al. The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation , 2018, ArXiv.
[34] Dejing Dou,et al. HotFlip: White-Box Adversarial Examples for Text Classification , 2017, ACL.
[35] J. Zico Kolter,et al. Provable defenses against adversarial examples via the convex outer adversarial polytope , 2017, ICML.
[36] Percy Liang,et al. Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.
[37] Shane Legg,et al. Deep Reinforcement Learning from Human Preferences , 2017, NIPS.
[38] Junfeng Yang,et al. DeepXplore: Automated Whitebox Testing of Deep Learning Systems , 2017, SOSP.
[39] Mykel J. Kochenderfer,et al. Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks , 2017, CAV.
[40] Joan Bruna,et al. Intriguing properties of neural networks , 2013, ICLR.
[41] Fabio Roli,et al. Evasion Attacks against Machine Learning at Test Time , 2013, ECML/PKDD.
[42] Nick Bostrom,et al. Existential Risk Prevention as Global Priority , 2013 .
[43] Vicente Ordonez,et al. Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.
[44] Sahar Abdelnabi,et al. More than you've asked for: A Comprehensive Analysis of Novel Prompt Injection Threats to Application-Integrated Large Language Models , 2023, ArXiv.