On the Trustworthiness Landscape of State-of-the-art Generative Models: A Comprehensive Survey

Diffusion models and large language models have emerged as leading-edge generative models and have sparked a revolutionary impact on various aspects of human life. However, the practical implementation of these models has also exposed inherent risks, highlighting their dual nature and raising concerns regarding their trustworthiness. Despite the abundance of literature on this subject, a comprehensive survey specifically delving into the intersection of large-scale generative models and their trustworthiness remains largely absent. To bridge this gap, This paper investigates both the long-standing and emerging threats associated with these models across four fundamental dimensions: privacy, security, fairness, and responsibility. In this way, we construct an extensive map outlining the trustworthiness of these models, while also providing practical recommendations and identifying future directions. These efforts are crucial for promoting the trustworthy deployment of these models, ultimately benefiting society as a whole.

[1]  Seong Joon Oh,et al.  ProPILE: Probing Privacy Leakage in Large Language Models , 2023, ArXiv.

[2]  S. Engelhardt,et al.  Investigating Data Memorization in 3D Latent Diffusion Models for Medical Image Synthesis , 2023, DGM4MICCAI.

[3]  Pang Wei Koh,et al.  Are aligned neural networks adversarially aligned? , 2023, NeurIPS.

[4]  P. Rokita,et al.  Towards More Realistic Membership Inference Attacks on Large Diffusion Models , 2023, ArXiv.

[5]  Changhua Meng,et al.  On the Robustness of Latent Diffusion Models , 2023, ArXiv.

[6]  Pin-Yu Chen,et al.  VillanDiffusion: A Unified Backdoor Attack Framework for Diffusion Models , 2023, NeurIPS.

[7]  Taylor Berg-Kirkpatrick,et al.  Membership Inference Attacks against Language Models via Neighbourhood Comparison , 2023, ACL.

[8]  Kaidi Xu,et al.  An Efficient Membership Inference Attack for the Diffusion Model by Proximal Initialization , 2023, ArXiv.

[9]  Shuming Shi,et al.  Deepfake Text Detection in the Wild , 2023, ArXiv.

[10]  Chaowei Xiao,et al.  ChatGPT as an Attack Tool: Stealthy Textual Backdoor Attack via Blackbox Generative Model Trigger , 2023, ArXiv.

[11]  Vishvak S. Murahari,et al.  Toxicity in ChatGPT: Analyzing Persona-assigned Language Models , 2023, EMNLP.

[12]  Sijia Liu,et al.  A Pilot Study of Query-Free Adversarial Attack against Stable Diffusion , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[13]  Mohit Iyyer,et al.  Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense , 2023, ArXiv.

[14]  Vinu Sankar Sadasivan,et al.  Can AI-Generated Text be Reliably Detected? , 2023, ArXiv.

[15]  Henrique Pondé de Oliveira Pinto,et al.  GPT-4 Technical Report , 2023, 2303.08774.

[16]  D. Song,et al.  TrojDiff: Trojan Attacks on Diffusion Models with Diverse Targets , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  A. Dragan,et al.  Automatically Auditing Large Language Models via Discrete Optimization , 2023, ICML.

[18]  Chen Chen,et al.  A Pathway Towards Responsible AI Generated Content , 2023, IJCAI.

[19]  Sahar Abdelnabi,et al.  Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection , 2023, AISec@CCS.

[20]  A. Madry,et al.  Raising the Cost of Malicious AI-Powered Image Editing , 2023, ICML.

[21]  Haibing Guan,et al.  Adversarial Example Does Good: Preventing Painting Imitation from Diffusion Models via Adversarial Examples , 2023, ICML.

[22]  Min Lin,et al.  Bag of Tricks for Training Data Extraction from Language Models , 2023, ICML.

[23]  Naoto Yanai,et al.  Membership Inference Attacks against Diffusion Models , 2023, 2023 IEEE Security and Privacy Workshops (SPW).

[24]  Shiqi Wang,et al.  Are Diffusion Models Vulnerable to Membership Inference Attacks? , 2023, ICML.

[25]  Shruti Tople,et al.  Analyzing Leakage of Personally Identifiable Information in Language Models , 2023, 2023 IEEE Symposium on Security and Privacy (SP).

[26]  Florian Tramèr,et al.  Extracting Training Data from Diffusion Models , 2023, USENIX Security Symposium.

[27]  Jun Pang,et al.  Membership Inference of Diffusion Models , 2023, ArXiv.

[28]  Jonathan Katz,et al.  A Watermark for Large Language Models , 2023, ICML.

[29]  Tsung-Yi Ho,et al.  How to Backdoor Diffusion Models? , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  T. Goldstein,et al.  Diffusion Art or Digital Forgery? Investigating Data Replication in Diffusion Models , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  F'abio Perez,et al.  Ignore Previous Prompt: Attack Techniques For Language Models , 2022, ArXiv.

[32]  Lukas Struppek,et al.  Rickrolling the Artist: Injecting Backdoors into Text Encoders for Text-to-Image Synthesis , 2022, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[33]  Shi-You Xu CLIP-Diffusion-LM: Apply Diffusion Model on Image Captioning , 2022, ArXiv.

[34]  M. Backes,et al.  Membership Inference Attacks Against Text-to-image Generation Models , 2022, ArXiv.

[35]  Florian Tramèr,et al.  Measuring Forgetting of Memorized Training Examples , 2022, ICLR.

[36]  D. Baron,et al.  Gradient Obfuscation Gives a False Sense of Security in Federated Learning , 2022, USENIX Security Symposium.

[37]  K. Chang,et al.  Are Large Pre-Trained Language Models Leaking Your Personal Information? , 2022, EMNLP.

[38]  Frank Wood,et al.  Flexible Diffusion Modeling of Long Videos , 2022, NeurIPS.

[39]  David J. Fleet,et al.  Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[40]  C. Chakrabarti,et al.  ResSFL: A Resistance Transfer Framework for Defending Model Inversion Attack in Split Federated Learning , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Xi Victoria Lin,et al.  OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.

[42]  Song Guo,et al.  Protect Privacy from Gradient Leakage Attack in Federated Learning , 2022, IEEE INFOCOM 2022 - IEEE Conference on Computer Communications.

[43]  Hao Li,et al.  Adversarial Attack and Defense: A Survey , 2022, Electronics.

[44]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[45]  Tom B. Brown,et al.  Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , 2022, ArXiv.

[46]  M. Lewis,et al.  Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? , 2022, Conference on Empirical Methods in Natural Language Processing.

[47]  Florian Tramèr,et al.  Quantifying Memorization Across Neural Language Models , 2022, ICLR.

[48]  Colin Raffel,et al.  Deduplicating Training Data Mitigates Privacy Risks in Language Models , 2022, ICML.

[49]  Geoffrey Irving,et al.  Red Teaming Language Models with Language Models , 2022, EMNLP.

[50]  Noah A. Smith,et al.  Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection , 2022, EMNLP.

[51]  Renelito Delos Santos,et al.  LaMDA: Language Models for Dialog Applications , 2022, ArXiv.

[52]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  D. Lischinski,et al.  Blended Diffusion for Text-driven Editing of Natural Images , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  David J. Fleet,et al.  Palette: Image-to-Image Diffusion Models , 2021, SIGGRAPH.

[55]  Martin T. Vechev,et al.  Bayesian Framework for Gradient Leakage , 2021, ICLR.

[56]  Jungseul Ok,et al.  Gradient Inversion with Generative Image Prior , 2021, NeurIPS.

[57]  Vinay Uday Prabhu,et al.  Multimodal datasets: misogyny, pornography, and malignant stereotypes , 2021, ArXiv.

[58]  Po-Sen Huang,et al.  Challenges in Detoxifying Language Models , 2021, EMNLP.

[59]  Owain Evans,et al.  TruthfulQA: Measuring How Models Mimic Human Falsehoods , 2021, ACL.

[60]  Quoc V. Le,et al.  Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.

[61]  Xipeng Qiu,et al.  Backdoor Attacks on Pre-trained Models by Layerwise Weight Poisoning , 2021, EMNLP.

[62]  A. E. Cicek,et al.  UnSplit: Data-Oblivious Model Inversion, Model Stealing, and Label Inference Attacks against Split Learning , 2021, IACR Cryptol. ePrint Arch..

[63]  Hiroaki Hayashi,et al.  Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing , 2021, ACM Comput. Surv..

[64]  Zhiyuan Liu,et al.  Turn the Combination Lock: Learnable Textual Backdoor Attacks via Word Substitution , 2021, ACL.

[65]  Zhiyuan Liu,et al.  Hidden Killer: Invisible Textual Backdoor Attacks with Syntactic Trigger , 2021, ACL.

[66]  David J. Fleet,et al.  Image Super-Resolution via Iterative Refinement , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[67]  Byron C. Wallace,et al.  Does BERT Pretrained on Clinical Notes Reveal Sensitive Data? , 2021, NAACL.

[68]  Pavlo Molchanov,et al.  See through Gradients: Image Batch Recovery via GradInversion , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Douwe Kiela,et al.  Gradient-based Adversarial Attacks against Text Transformers , 2021, EMNLP.

[70]  Hongsheng Hu,et al.  Membership Inference Attacks on Machine Learning: A Survey , 2021, ACM Comput. Surv..

[71]  Hang Liu,et al.  TAG: Gradient Attack on Transformer-based Language Models , 2021, EMNLP.

[72]  Timo Schick,et al.  Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP , 2021, Transactions of the Association for Computational Linguistics.

[73]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[74]  D. Klein,et al.  Calibrate Before Use: Improving Few-Shot Performance of Language Models , 2021, ICML.

[75]  Miles Brundage,et al.  Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models , 2021, ArXiv.

[76]  John Pavlopoulos,et al.  Civil Rephrases Of Toxic Texts With Self-Supervised Transformers , 2021, EACL.

[77]  Yejin Choi,et al.  Challenges in Automated Debiasing for Toxic Language Detection , 2021, EACL.

[78]  Danqi Chen,et al.  Making Pre-trained Language Models Better Few-shot Learners , 2021, ACL.

[79]  Colin Raffel,et al.  Extracting Training Data from Large Language Models , 2020, USENIX Security Symposium.

[80]  Jingwei Sun,et al.  Soteria: Provable Defense against Privacy Leakage in Federated Learning from Representation Perspective , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[81]  G. Ateniese,et al.  Unleashing the Tiger: Inference Attacks on Split Learning , 2020, CCS.

[82]  Shangwei Guo,et al.  Privacy-preserving Collaborative Learning with Automatic Transformation Search , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[83]  Li Li,et al.  A review of applications in federated learning , 2020, Comput. Ind. Eng..

[84]  Samuel R. Bowman,et al.  CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models , 2020, EMNLP.

[85]  Yejin Choi,et al.  RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , 2020, FINDINGS.

[86]  Kris McGuffie,et al.  The Radicalization Risks of GPT-3 and Advanced Neural Language Models , 2020, ArXiv.

[87]  Shafiq R. Joty,et al.  GeDi: Generative Discriminator Guided Sequence Generation , 2020, EMNLP.

[88]  Wenqi Wei,et al.  A Framework for Evaluating Client Privacy Leakages in Federated Learning , 2020, ESORICS.

[89]  D. Song,et al.  Aligning AI With Shared Human Values , 2020, ICLR.

[90]  Yong Jiang,et al.  Backdoor Learning: A Survey , 2020, IEEE transactions on neural networks and learning systems.

[91]  Vinay Uday Prabhu,et al.  Large image datasets: A pyrrhic win for computer vision? , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[92]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[93]  Michael Backes,et al.  BadNL: Backdoor Attacks against NLP Models with Semantic-preserving Improvements , 2020, ACSAC.

[94]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[95]  Doug Downey,et al.  Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.

[96]  Siva Reddy,et al.  StereoSet: Measuring stereotypical bias in pretrained language models , 2020, ACL.

[97]  Graham Neubig,et al.  Weight Poisoning Attacks on Pretrained Models , 2020, ACL.

[98]  T. Rabin,et al.  Falcon: Honest-Majority Maliciously Secure Framework for Private Deep Learning , 2020, Proc. Priv. Enhancing Technol..

[99]  Siddhant Garg,et al.  BAE: BERT-based Adversarial Examples for Text Classification , 2020, EMNLP.

[100]  Michael Moeller,et al.  Inverting Gradients - How easy is it to break privacy in federated learning? , 2020, NeurIPS.

[101]  Timo Schick,et al.  Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference , 2020, EACL.

[102]  Bo Zhao,et al.  iDLG: Improved Deep Leakage from Gradients , 2020, ArXiv.

[103]  Chris Callison-Burch,et al.  Human and Automatic Detection of Generated Text , 2019, ArXiv.

[104]  S. Ermon,et al.  Fair Generative Modeling via Weak Supervision , 2019, ICML.

[105]  J. Yosinski,et al.  Plug and Play Language Models: A Simple Approach to Controlled Text Generation , 2019, ICLR.

[106]  Alec Radford,et al.  Release Strategies and the Social Impacts of Language Models , 2019, ArXiv.

[107]  Sameer Singh,et al.  Universal Adversarial Triggers for Attacking and Analyzing NLP , 2019, EMNLP.

[108]  Joey Tianyi Zhou,et al.  Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment , 2019, AAAI.

[109]  Wanxiang Che,et al.  Generating Natural Language Adversarial Examples through Probability Weighted Word Saliency , 2019, ACL.

[110]  Song Han,et al.  Deep Leakage from Gradients , 2019, NeurIPS.

[111]  Alexander M. Rush,et al.  GLTR: Statistical Detection and Visualization of Generated Text , 2019, ACL.

[112]  Ingmar Weber,et al.  Racial Bias in Hate Speech and Abusive Language Detection Datasets , 2019, Proceedings of the Third Workshop on Abusive Language Online.

[113]  Kevin Duh,et al.  Membership Inference Attacks on Sequence-to-Sequence Models: Is My Data In Your Machine Translation System? , 2019, TACL.

[114]  Eric Horvitz,et al.  Bias Correction of Learned Generative Models using Likelihood-Free Importance Weighting , 2019, DGS@ICLR.

[115]  Mona Attariyan,et al.  Parameter-Efficient Transfer Learning for NLP , 2019, ICML.

[116]  Ramesh Raskar,et al.  Split learning for health: Distributed deep learning without sharing raw patient data , 2018, ArXiv.

[117]  Teresa K. O'Leary,et al.  Patient and Consumer Safety Risks When Using Conversational Assistants for Medical Information: An Observational Study of Siri, Alexa, and Google Assistant , 2018, Journal of medical Internet research.

[118]  Cícero Nogueira dos Santos,et al.  Fighting Offensive Language on Social Media with Unsupervised Text Style Transfer , 2018, ACL.

[119]  Mani B. Srivastava,et al.  Generating Natural Language Adversarial Examples , 2018, EMNLP.

[120]  Brendan Dolan-Gavitt,et al.  BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain , 2017, ArXiv.

[121]  Sameep Mehta,et al.  Towards Crafting Text Adversarial Samples , 2017, ArXiv.

[122]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[123]  Ian Goodfellow,et al.  Deep Learning with Differential Privacy , 2016, CCS.

[124]  Adam S. Miner,et al.  Smartphone-Based Conversational Agents and Responses to Questions About Mental Health, Interpersonal Violence, and Physical Health. , 2016, JAMA internal medicine.

[125]  Surya Ganguli,et al.  Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.

[126]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[127]  Donald L. Fisher,et al.  Tuning collision warning algorithms to individual drivers for design of active safety systems , 2013, 2013 IEEE Globecom Workshops (GC Wkshps).

[128]  T. Scheeren Plug and Play , 2008 .

[129]  Lukas Struppek,et al.  Rickrolling the Artist: Injecting Invisible Backdoors into Text-Guided Image Generation Models , 2022, ArXiv.

[130]  Weigang Wu,et al.  Mixing Activations and Labels in Distributed Training for Split Learning , 2021, IEEE Transactions on Parallel and Distributed Systems.

[131]  Moinuddin K. Qureshi,et al.  Gradient Inversion Attack: Leaking Private Labels in Two-Party Split Learning , 2021, ArXiv.

[132]  Roma Patel,et al.  “Was it “stated” or was it “claimed”?: How linguistic bias affects generative language models , 2021, EMNLP.

[133]  Peng Li,et al.  Rethinking Stealthiness of Backdoor Attack against NLP Models , 2021, ACL.

[134]  Yang Liu,et al.  BatchCrypt: Efficient Homomorphic Encryption for Cross-Silo Federated Learning , 2020, USENIX ATC.

[135]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[136]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[137]  Saied Alshahrani Word-level Textual Adversarial Attacking as Combinatorial Optimization , 2022 .