Large-Scale Differentially Private BERT

In this work, we study the large-scale pretraining of BERT-Large [DCLT19] with differentially private SGD (DP-SGD). We show that combined with a careful implementation, scaling up the batch size to millions (i.e., mega-batches) improves the utility of the DP-SGD step for BERT; we also enhance its efficiency by using an increasing batch size schedule. Our implementation builds on the recent work of [SVK20], who demonstrated that the overhead of a DP-SGD step is minimized with effective use of JAX [BFH18, FJL18] primitives in conjunction with the XLA compiler [XLA17]. Our implementation achieves a masked language model accuracy of 60.5% at a batch size of 2M, for = 5.36. To put this number in perspective, non-private BERT models achieve an accuracy of ∼70%.

[1]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[2]  George E. Dahl,et al.  Faster Neural Network Training with Data Echoing , 2019, ArXiv.

[3]  Joshua Ainslie,et al.  FNet: Mixing Tokens with Fourier Transforms , 2021, NAACL.

[4]  James Demmel,et al.  Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.

[5]  Manfred K. Warmuth,et al.  LocoProp: Enhancing BackProp via Local Loss Optimization , 2021, ArXiv.

[6]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[7]  Minhyung Cho,et al.  Riemannian approach to batch normalization , 2017, NIPS.

[8]  Elad Hoffer,et al.  Norm matters: efficient and accurate normalization schemes in deep networks , 2018, NeurIPS.

[9]  Noam Shazeer,et al.  GSPMD: General and Scalable Parallelization for ML Computation Graphs , 2021, ArXiv.

[10]  Yossi Matias,et al.  Learning and Evaluating a Differentially Private Pre-trained Language Model , 2021, PRIVATENLP.

[11]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12]  Alexander Kolesnikov,et al.  MLP-Mixer: An all-MLP Architecture for Vision , 2021, ArXiv.

[13]  Orhan Firat,et al.  GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.

[14]  Li Zhang,et al.  Learning Differentially Private Language Models Without Losing Accuracy , 2017, ArXiv.

[15]  Manfred K. Warmuth,et al.  Robust Bi-Tempered Logistic Loss Based on Bregman Divergences , 2019, NeurIPS.

[16]  George E. Dahl,et al.  A Large Batch Optimizer Reality Check: Traditional, Generic Optimizers Suffice Across Batch Sizes , 2021, ArXiv.

[17]  Seong Joon Oh,et al.  AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights , 2021, ICLR.

[18]  Moni Naor,et al.  Our Data, Ourselves: Privacy Via Distributed Noise Generation , 2006, EUROCRYPT.

[19]  Jascha Sohl-Dickstein,et al.  Measuring the Effects of Data Parallelism on Neural Network Training , 2018, J. Mach. Learn. Res..

[20]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[21]  H. Brendan McMahan,et al.  Differentially Private Learning with Adaptive Clipping , 2019, NeurIPS.

[22]  Úlfar Erlingsson,et al.  The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks , 2018, USENIX Security Symposium.

[23]  Yoram Singer,et al.  Shampoo: Preconditioned Stochastic Tensor Optimization , 2018, ICML.

[24]  Ian Goodfellow,et al.  Deep Learning with Differential Privacy , 2016, CCS.

[25]  Noam Shazeer,et al.  Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , 2018, ICML.

[26]  Manfred K. Warmuth,et al.  Step-size Adaptation Using Exponentiated Gradient Updates , 2022, ArXiv.

[27]  Geoffrey E. Hinton,et al.  Large scale distributed neural network training through online distillation , 2018, ICLR.

[28]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[29]  Tim Salimans,et al.  Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[30]  Gautam Kamath,et al.  Enabling Fast Differentially Private SGD via Just-in-Time Compilation and Vectorization , 2020, NeurIPS.

[31]  Ilya Mironov,et al.  Rényi Differential Privacy , 2017, 2017 IEEE 30th Computer Security Foundations Symposium (CSF).

[32]  Frank Hutter,et al.  Fixing Weight Decay Regularization in Adam , 2017, ArXiv.

[33]  Aaron Roth,et al.  The Algorithmic Foundations of Differential Privacy , 2014, Found. Trends Theor. Comput. Sci..

[34]  Vitaly Feldman,et al.  Does learning require memorization? a short tale about a long tail , 2019, STOC.

[35]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[36]  Yoram Singer,et al.  Memory Efficient Adaptive Optimization , 2019, NeurIPS.

[37]  Dietrich Klakow,et al.  Robust Differentially Private Training of Deep Neural Networks , 2020, ArXiv.

[38]  Antti Honkela,et al.  Learning Rate Adaptation for Differentially Private Learning , 2020, AISTATS.

[39]  Naman Agarwal,et al.  Stochastic Optimization with Laggard Data Pipelines , 2020, NeurIPS.

[40]  Vitaly Feldman,et al.  When is memorization of irrelevant training data necessary for high-accuracy learning? , 2020, STOC.

[41]  Sashank J. Reddi,et al.  AdaCliP: Adaptive Clipping for Private SGD , 2019, ArXiv.

[42]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[43]  Matthew Johnson,et al.  Compiling machine learning programs via high-level tracing , 2018 .

[44]  Quoc V. Le,et al.  Don't Decay the Learning Rate, Increase the Batch Size , 2017, ICLR.

[45]  Colin Raffel,et al.  Extracting Training Data from Large Language Models , 2020, USENIX Security Symposium.

[46]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.