Can AI-Generated Text be Reliably Detected?

The rapid progress of Large Language Models (LLMs) has made them capable of performing astonishingly well on various tasks including document completion and question answering. The unregulated use of these models, however, can potentially lead to malicious consequences such as plagiarism, generating fake news, spamming, etc. Therefore, reliable detection of AI-generated text can be critical to ensure the responsible use of LLMs. Recent works attempt to tackle this problem either using certain model signatures present in the generated text outputs or by applying watermarking techniques that imprint specific patterns onto them. In this paper, both empirically and theoretically, we show that these detectors are not reliable in practical scenarios. Empirically, we show that paraphrasing attacks, where a light paraphraser is applied on top of the generative text model, can break a whole range of detectors, including the ones using the watermarking schemes as well as neural network-based detectors and zero-shot classifiers. We then provide a theoretical impossibility result indicating that for a sufficiently good language model, even the best-possible detector can only perform marginally better than a random classifier. Finally, we show that even LLMs protected by watermarking schemes can be vulnerable against spoofing attacks where adversarial humans can infer hidden watermarking signatures and add them to their generated text to be detected as text generated by the LLMs, potentially causing reputational damages to their developers. We believe these results can open an honest conversation in the community regarding the ethical and reliable use of AI-generated text.

[1]  Henrique Pondé de Oliveira Pinto,et al.  GPT-4 Technical Report , 2023, 2303.08774.

[2]  Lei Li,et al.  Protecting Language Generation Models via Invisible Watermarking , 2023, ArXiv.

[3]  Christopher D. Manning,et al.  DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature , 2023, ArXiv.

[4]  Jonathan Katz,et al.  A Watermark for Large Language Models , 2023, ArXiv.

[5]  David J. Fleet,et al.  Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[6]  Xi Victoria Lin,et al.  OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.

[7]  Alexander Levine,et al.  Improved Certified Defenses against Data Poisoning with (Deterministic) Finite Aggregation , 2022, ICML.

[8]  T. Goldstein,et al.  Certifying Model Accuracy under Distribution Shifts , 2022, ArXiv.

[9]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  M. Tesconi,et al.  TweepFake: About detecting deepfake tweets , 2020, PloS one.

[11]  Muhammad Abdul-Mageed,et al.  Automatic Detection of Machine Generated Text: A Critical Survey , 2020, COLING.

[12]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[13]  Peter J. Liu,et al.  PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization , 2019, ICML.

[14]  Chris Callison-Burch,et al.  Human and Automatic Detection of Generated Text , 2019, ArXiv.

[15]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[16]  Junichi Yamagishi,et al.  Generating Sentiment-Preserving Fake Online Reviews Using Neural Language Models and Their Human- and Machine-based Detection , 2019, AINA.

[17]  M. Weiss Deepfake Bot Submissions to Federal Public Comment Websites Cannot Be Distinguished from Human Submissions , 2019 .

[18]  Alec Radford,et al.  Release Strategies and the Social Impacts of Language Models , 2019, ArXiv.

[19]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[20]  Marc'Aurelio Ranzato,et al.  Real or Fake? Learning to Discriminate Machine from Human Generated Text , 2019, ArXiv.

[21]  Alexander M. Rush,et al.  GLTR: Statistical Detection and Visualization of Generated Text , 2019, ACL.

[22]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[23]  Mirella Lapata,et al.  Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization , 2018, EMNLP.

[24]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[25]  Alex Wilson,et al.  Linguistic steganography on Twitter: hierarchical language modeling with manual interaction , 2014, Electronic Imaging.

[26]  Mikhail J. Atallah,et al.  Natural Language Watermarking: Design, Analysis, and a Proof-of-Concept Implementation , 2001, Information Hiding.