论文信息 - Large-Scale Differentially Private BERT - 字舞流文

Large-Scale Differentially Private BERT

In this work, we study the large-scale pretraining of BERT-Large [DCLT19] with differentially private SGD (DP-SGD). We show that combined with a careful implementation, scaling up the batch size to millions (i.e., mega-batches) improves the utility of the DP-SGD step for BERT; we also enhance its efficiency by using an increasing batch size schedule. Our implementation builds on the recent work of [SVK20], who demonstrated that the overhead of a DP-SGD step is minimized with effective use of JAX [BFH18, FJL18] primitives in conjunction with the XLA compiler [XLA17]. Our implementation achieves a masked language model accuracy of 60.5% at a batch size of 2M, for = 5.36. To put this number in perspective, non-private BERT models achieve an accuracy of ∼70%.

Badih Ghazi | Pasin Manurangsi | Vineet Gupta | Ravi Kumar | Rohan Anil | Vineet Gupta | Rohan Anil | Badih Ghazi | Ravi Kumar | Pasin Manurangsi

[1] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[2] George E. Dahl,et al. Faster Neural Network Training with Data Echoing , 2019, ArXiv.

[3] Joshua Ainslie,et al. FNet: Mixing Tokens with Fourier Transforms , 2021, NAACL.

[4] James Demmel,et al. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.

[5] Manfred K. Warmuth,et al. LocoProp: Enhancing BackProp via Local Loss Optimization , 2021, ArXiv.

[6] Cynthia Dwork,et al. Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[7] Minhyung Cho,et al. Riemannian approach to batch normalization , 2017, NIPS.

[8] Elad Hoffer,et al. Norm matters: efficient and accurate normalization schemes in deep networks , 2018, NeurIPS.

[9] Noam Shazeer,et al. GSPMD: General and Scalable Parallelization for ML Computation Graphs , 2021, ArXiv.

[10] Yossi Matias,et al. Learning and Evaluating a Differentially Private Pre-trained Language Model , 2021, PRIVATENLP.

[11] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12] Alexander Kolesnikov,et al. MLP-Mixer: An all-MLP Architecture for Vision , 2021, ArXiv.

[13] Orhan Firat,et al. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.

[14] Li Zhang,et al. Learning Differentially Private Language Models Without Losing Accuracy , 2017, ArXiv.

[15] Manfred K. Warmuth,et al. Robust Bi-Tempered Logistic Loss Based on Bregman Divergences , 2019, NeurIPS.

[16] George E. Dahl,et al. A Large Batch Optimizer Reality Check: Traditional, Generic Optimizers Suffice Across Batch Sizes , 2021, ArXiv.

[17] Seong Joon Oh,et al. AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights , 2021, ICLR.

[18] Moni Naor,et al. Our Data, Ourselves: Privacy Via Distributed Noise Generation , 2006, EUROCRYPT.

[19] Jascha Sohl-Dickstein,et al. Measuring the Effects of Data Parallelism on Neural Network Training , 2018, J. Mach. Learn. Res..

[20] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[21] H. Brendan McMahan,et al. Differentially Private Learning with Adaptive Clipping , 2019, NeurIPS.

[22] Úlfar Erlingsson,et al. The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks , 2018, USENIX Security Symposium.

[23] Yoram Singer,et al. Shampoo: Preconditioned Stochastic Tensor Optimization , 2018, ICML.

[24] Ian Goodfellow,et al. Deep Learning with Differential Privacy , 2016, CCS.

[25] Noam Shazeer,et al. Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , 2018, ICML.

[26] Manfred K. Warmuth,et al. Step-size Adaptation Using Exponentiated Gradient Updates , 2022, ArXiv.

[27] Geoffrey E. Hinton,et al. Large scale distributed neural network training through online distillation , 2018, ICLR.

[28] Rico Sennrich,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[29] Tim Salimans,et al. Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[30] Gautam Kamath,et al. Enabling Fast Differentially Private SGD via Just-in-Time Compilation and Vectorization , 2020, NeurIPS.

[31] Ilya Mironov,et al. Rényi Differential Privacy , 2017, 2017 IEEE 30th Computer Security Foundations Symposium (CSF).

[32] Frank Hutter,et al. Fixing Weight Decay Regularization in Adam , 2017, ArXiv.

[33] Aaron Roth,et al. The Algorithmic Foundations of Differential Privacy , 2014, Found. Trends Theor. Comput. Sci..

[34] Vitaly Feldman,et al. Does learning require memorization? a short tale about a long tail , 2019, STOC.

[35] Sanja Fidler,et al. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[36] Yoram Singer,et al. Memory Efficient Adaptive Optimization , 2019, NeurIPS.

[37] Dietrich Klakow,et al. Robust Differentially Private Training of Deep Neural Networks , 2020, ArXiv.

[38] Antti Honkela,et al. Learning Rate Adaptation for Differentially Private Learning , 2020, AISTATS.

[39] Naman Agarwal,et al. Stochastic Optimization with Laggard Data Pipelines , 2020, NeurIPS.

[40] Vitaly Feldman,et al. When is memorization of irrelevant training data necessary for high-accuracy learning? , 2020, STOC.

[41] Sashank J. Reddi,et al. AdaCliP: Adaptive Clipping for Private SGD , 2019, ArXiv.

[42] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[43] Matthew Johnson,et al. Compiling machine learning programs via high-level tracing , 2018 .

[44] Quoc V. Le,et al. Don't Decay the Learning Rate, Increase the Batch Size , 2017, ICLR.

[45] Colin Raffel,et al. Extracting Training Data from Large Language Models , 2020, USENIX Security Symposium.

[46] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.