Large Language Models Can Be Strong Differentially Private Learners

Differentially Private (DP) learning has seen limited success for building large deep learning models of text, and straightforward attempts at applying Differentially Private Stochastic Gradient Descent (DP-SGD) to NLP tasks have resulted in large performance drops and high computational overhead. We show that this performance drop can be mitigated with (1) the use of large pretrained language models; (2) non-standard hyperparameters that suit DP optimization; and (3) fine-tuning objectives which are aligned with the pretraining procedure. With the above, we obtain NLP models that outperform state-of-the-art DP-trained models under the same privacy budget and strong non-private baselines—by directly fine-tuning pretrained models with DP optimization on moderately-sized corpora. To address the computational challenge of running DP-SGD with large Transformers, we propose a memory saving technique that allows clipping in DP-SGD to run without instantiating per-example gradients for any linear layer in the model. The technique enables privately training Transformers with almost the same memory cost as non-private training at a modest run-time overhead. Contrary to conventional wisdom that DP optimization fails at learning high-dimensional models (due to noise that scales with dimension) empirical results reveal that private learning with pretrained language models tends to not suffer from dimension-dependent performance degradation. Code to reproduce results can be found at https: . on par with or outperforms others methods that execute gradient update in low dimensional spaces. Results are on E2E from fine-tuning GPT-2.

[1]  Anna Rumshisky,et al.  An Efficient DP-SGD Mechanism for Large Scale NLU Models , 2021, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Yoav Goldberg,et al.  BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models , 2021, ACL.

[3]  Edward A. Fox,et al.  Differentially Private Synthetic Medical Data Generation using Convolutional GANs , 2020, Inf. Sci..

[4]  Aaron Roth,et al.  Gaussian differential privacy , 2019, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[5]  Huseyin A. Inan,et al.  Differentially Private Fine-tuning of Language Models , 2021, ICLR.

[6]  Nicolas Papernot,et al.  Hyperparameter Tuning with Renyi Differential Privacy , 2021, ICLR.

[7]  Graham Cormode,et al.  Opacus: User-Friendly Differential Privacy Library in PyTorch , 2021, ArXiv.

[8]  Quoc V. Le,et al.  Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.

[9]  Michael S. Bernstein,et al.  On the Opportunities and Risks of Foundation Models , 2021, ArXiv.

[10]  Badih Ghazi,et al.  Large-Scale Differentially Private BERT , 2021, EMNLP.

[11]  John Duchi,et al.  Private Adaptive Gradient Methods for Convex Optimization , 2021, ICML.

[12]  Yelong Shen,et al.  LoRA: Low-Rank Adaptation of Large Language Models , 2021, ICLR.

[13]  Huishuai Zhang,et al.  Large Scale Private Learning via Low-rank Reparametrization , 2021, ICML.

[14]  Sivakanth Gopi,et al.  Numerical Composition of Differential Privacy , 2021, NeurIPS.

[15]  Abhinav Aggarwal,et al.  On a Utilitarian Approach to Privacy Preserving Text Generation , 2021, PRIVATENLP.

[16]  Brian Lester,et al.  The Power of Scale for Parameter-Efficient Prompt Tuning , 2021, EMNLP.

[17]  Huseyin A. Inan,et al.  Privacy Regularization: Joint Privacy-Utility Optimization in LanguageModels , 2021, NAACL.

[18]  Ilya Mironov,et al.  Wide Network Learning with Differential Privacy , 2021, ArXiv.

[19]  Wei Chen,et al.  Do Not Let Privacy Overbill Utility: Gradient Embedding Perturbation for Private Learning , 2021, ICLR.

[20]  Janardhan Kulkarni,et al.  Fast and Memory Efficient Differentially Private-SGD via JL Projections , 2021, NeurIPS.

[21]  Diyi Yang,et al.  The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics , 2021, GEM.

[22]  Danqi Chen,et al.  Making Pre-trained Language Models Better Few-shot Learners , 2021, ACL.

[23]  Colin Raffel,et al.  Extracting Training Data from Large Language Models , 2020, USENIX Security Symposium.

[24]  Dan Boneh,et al.  Differentially Private Learning Needs Better Features (or Much More Data) , 2020, ICLR.

[25]  Iryna Gurevych,et al.  AdapterDrop: On the Efficiency of Adapters in Transformers , 2020, EMNLP.

[26]  Gautam Kamath,et al.  Enabling Fast Differentially Private SGD via Just-in-Time Compilation and Vectorization , 2020, NeurIPS.

[27]  Daniel Kifer,et al.  Scaling up Differentially Private Deep Learning with Fast Per-Example Gradient Clipping , 2020, Proc. Priv. Enhancing Technol..

[28]  Zhiwei Steven Wu,et al.  Private Post-GAN Boosting , 2020, ICLR.

[29]  Zhiwei Steven Wu,et al.  Bypassing the Ambient Dimension: Private SGD with Gradient Subspace Identification , 2020, ICLR.

[30]  Tao Yu,et al.  DART: Open-Domain Structured Data Record to Text Generation , 2020, NAACL.

[31]  Iryna Gurevych,et al.  AdapterFusion: Non-Destructive Task Composition for Transfer Learning , 2020, EACL.

[32]  Joe Davison,et al.  Compacter: Efficient Low-Rank Hypercomplex Adapter Layers , 2021, NeurIPS.

[33]  Yossi Matias,et al.  Learning and Evaluating a Differentially Private Pre-trained Language Model , 2021, PRIVATENLP.

[34]  Percy Liang,et al.  Prefix-Tuning: Optimizing Continuous Prompts for Generation , 2021, ACL.

[35]  Kai Li,et al.  TextHide: Tackling Data Privacy for Language Understanding Tasks , 2020, FINDINGS.

[36]  Kai Li,et al.  InstaHide: Instance-hiding Schemes for Private Distributed Learning , 2020, ICML.

[37]  H. Brendan McMahan,et al.  Training Production Language Models without Memorizing User Data , 2020, ArXiv.

[38]  Dylan Slack,et al.  Differentially Private Language Models Benefit from Public Pre-training , 2020, PRIVATENLP.

[39]  Song Han,et al.  TinyTL: Reduce Memory, Not Parameters for Efficient On-Device Learning , 2020, NeurIPS.

[40]  Tribhuvanesh Orekondy,et al.  GS-WGAN: A Gradient-Sanitized Approach for Learning Differentially Private Generators , 2020, NeurIPS.

[41]  Omer Levy,et al.  Emergent linguistic structure in artificial neural networks trained by self-supervision , 2020, Proceedings of the National Academy of Sciences.

[42]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[43]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[44]  Weijie J. Su,et al.  Deep Learning with Gaussian Differential Privacy , 2019, Harvard data science review.

[45]  Jianfeng Gao,et al.  DIALOGPT : Large-Scale Generative Pre-training for Conversational Response Generation , 2019, ACL.

[46]  A. Honkela,et al.  Computing Tight Differential Privacy Guarantees Using FFT , 2019, AISTATS.

[47]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[48]  Shuang Song,et al.  Making the Shoe Fit: Architectures, Initializations, and Tuning for Learning with Privacy , 2019 .

[49]  Sebastian Riedel,et al.  Language Models as Knowledge Bases? , 2019, EMNLP.

[50]  Li Zhang,et al.  Rényi Differential Privacy of the Sampled Gaussian Mechanism , 2019, ArXiv.

[51]  Sashank J. Reddi,et al.  AdaCliP: Adaptive Clipping for Private SGD , 2019, ArXiv.

[52]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[53]  Reihaneh Torkzadehmahani,et al.  DP-CGAN: Differentially Private Synthetic Data and Label Generation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[54]  Oren Melamud,et al.  Towards Automatic Generation of Shareable Synthetic Clinical Notes Using Neural Language Models , 2019, Proceedings of the 2nd Clinical Natural Language Processing Workshop.

[55]  Mona Attariyan,et al.  Parameter-Efficient Transfer Learning for NLP , 2019, ICML.

[56]  Joelle Pineau,et al.  The Second Conversational Intelligence Challenge (ConvAI2) , 2019, The NeurIPS '18 Competition.

[57]  Kunal Talwar,et al.  Private selection from private candidates , 2018, STOC.

[58]  Úlfar Erlingsson,et al.  The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks , 2018, USENIX Security Symposium.

[59]  Emiliano De Cristofaro,et al.  LOGAN: Membership Inference Attacks Against Generative Models , 2017, Proc. Priv. Enhancing Technol..

[60]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[61]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[62]  Shashi Narayan,et al.  Privacy-preserving Neural Representations of Text , 2018, EMNLP.

[63]  Alexander M. Rush,et al.  Learning Neural Templates for Text Generation , 2018, EMNLP.

[64]  Jianfeng Gao,et al.  Neural Approaches to Conversational AI , 2018, ACL.

[65]  Úlfar Erlingsson,et al.  Scalable Private Learning with PATE , 2018, ICLR.

[66]  Jason Weston,et al.  Personalizing Dialogue Agents: I have a dog, do you have pets too? , 2018, ACL.

[67]  H. Brendan McMahan,et al.  Learning Differentially Private Recurrent Language Models , 2017, ICLR.

[68]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.

[69]  Emiliano De Cristofaro,et al.  : Membership Inference Attacks Against Generative Models , 2018 .

[70]  Samuel R. Bowman,et al.  The Multi-Genre NLI Corpus , 2018 .

[71]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[72]  Verena Rieser,et al.  The E2E Dataset: New Challenges For End-to-End Generation , 2017, SIGDIAL Conference.

[73]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[74]  Ilya Mironov,et al.  Rényi Differential Privacy , 2017, 2017 IEEE 30th Computer Security Foundations Symposium (CSF).

[75]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[76]  Martín Abadi,et al.  Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data , 2016, ICLR.

[77]  Vitaly Shmatikov,et al.  Membership Inference Attacks Against Machine Learning Models , 2016, 2017 IEEE Symposium on Security and Privacy (SP).

[78]  Ian Goodfellow,et al.  Deep Learning with Differential Privacy , 2016, CCS.

[79]  Jianfeng Gao,et al.  A Diversity-Promoting Objective Function for Neural Conversation Models , 2015, NAACL.

[80]  Ian J. Goodfellow,et al.  Efficient Per-Example Gradient Computations , 2015, ArXiv.

[81]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[82]  Aaron Roth,et al.  The Algorithmic Foundations of Differential Privacy , 2014, Found. Trends Theor. Comput. Sci..

[83]  Raef Bassily,et al.  Differentially Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds , 2014, 1405.7085.

[84]  Kamalika Chaudhuri,et al.  A Stability-based Validation Procedure for Differentially Private Machine Learning , 2013, NIPS.

[85]  Anand D. Sarwate,et al.  Stochastic gradient descent with differentially private updates , 2013, 2013 IEEE Global Conference on Signal and Information Processing.

[86]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.