论文信息 - Robust Training in High Dimensions via Block Coordinate Geometric Median Descent

Robust Training in High Dimensions via Block Coordinate Geometric Median Descent

Geometric median (Gm) is a classical method in statistics for achieving a robust estimation of the uncorrupted data; under gross corruption, it achieves the optimal breakdown point of 0.5. However, its computational complexity makes it infeasible for robustifying stochastic gradient descent (SGD) for high-dimensional optimization problems. In this paper, we show that by applying Gm to only a judiciously chosen block of coordinates at a time and using a memory mechanism, one can retain the breakdown point of 0.5 for smooth non-convex problems, with non-asymptotic convergence rates comparable to the SGD with Gm.1

[1] P. Tseng,et al. Block Coordinate Relaxation Methods for Nonparametric Wavelet Denoising , 2000 .

[2] Thorsten Joachims,et al. Making large scale SVM learning practical , 1998 .

[3] H. Robbins. A Stochastic Approximation Method , 1951 .

[4] Blaise Agüera y Arcas,et al. Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[5] Daniel M. Kane,et al. Recent Advances in Algorithmic High-Dimensional Robust Statistics , 2019, ArXiv.

[6] Indranil Gupta,et al. Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance , 2018, ICML.

[7] Martin Jaggi,et al. Sparsified SGD with Memory , 2018, NeurIPS.

[8] Xiang Li,et al. On the Convergence of FedAvg on Non-IID Data , 2019, ICLR.

[9] Sashank J. Reddi,et al. SCAFFOLD: Stochastic Controlled Averaging for Federated Learning , 2019, ICML.

[10] Thomas G. Dietterich,et al. Benchmarking Neural Network Robustness to Common Corruptions and Perturbations , 2018, ICLR.

[11] Jerry Li,et al. Spectral Signatures in Backdoor Attacks , 2018, NeurIPS.

[12] Jorge Nocedal,et al. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[13] Waheed U. Bajwa,et al. BRIDGE: Byzantine-Resilient Decentralized Gradient Descent , 2019, IEEE Transactions on Signal and Information Processing over Networks.

[14] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[15] Frederick R. Forst,et al. On robust estimation of the location parameter , 1980 .

[16] Wenjiang J. Fu. Penalized Regressions: The Bridge versus the Lasso , 1998 .

[17] Mark W. Schmidt,et al. Coordinate Descent Converges Faster with the Gauss-Southwell Rule Than Random Selection , 2015, ICML.

[18] Mark W. Schmidt,et al. Let's Make Block Coordinate Descent Go Fast: Faster Greedy Rules, Message-Passing, Active-Set Complexity, and Superlinear Convergence , 2017 .

[19] J. von Neumann,et al. Probabilistic Logic and the Synthesis of Reliable Organisms from Unreliable Components , 1956 .

[20] Siddharth Garg,et al. BadNets: Evaluating Backdooring Attacks on Deep Neural Networks , 2019, IEEE Access.

[21] Suhas Diggavi,et al. Qsparse-Local-SGD: Distributed SGD With Quantization, Sparsification, and Local Computations , 2019, IEEE Journal on Selected Areas in Information Theory.

[22] Sebastian U. Stich,et al. Unified Optimal Analysis of the (Stochastic) Gradient Method , 2019, ArXiv.

[23] Pradeep Ravikumar,et al. Nearest Neighbor based Greedy Coordinate Descent , 2011, NIPS.

[24] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[25] Hui Zhang. New analysis of linear convergence of gradient-type methods via unifying error bound conditions , 2020, Math. Program..

[26] Rachid Guerraoui,et al. Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent , 2017, NIPS.

[27] Pramod K. Varshney,et al. On Distributed Stochastic Gradient Descent for Nonconvex Functions in the Presence of Byzantines , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28] Nirupam Gupta,et al. Byzantine Fault-Tolerant Distributed Machine Learning Using Stochastic Gradient Descent (SGD) and Norm-Based Comparative Gradient Elimination (CGE) , 2020, ArXiv.

[29] Ohad Shamir,et al. Optimal Distributed Online Prediction Using Mini-Batches , 2010, J. Mach. Learn. Res..

[30] Yurii Nesterov,et al. Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..

[31] Sebastian U. Stich,et al. The Error-Feedback Framework: Better Rates for SGD with Delayed Gradients and Compressed Communication , 2019, 1909.05350.

[32] Martin Jaggi,et al. Approximate Steepest Coordinate Descent , 2017, ICML.

[33] Daniel M. Kane,et al. Robust Estimators in High Dimensions without the Computational Intractability , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[34] G. Nemhauser,et al. Maximizing Submodular Set Functions: Formulations and Analysis of Algorithms* , 1981 .

[35] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36] Nikko Strom,et al. Scalable distributed DNN training using commodity GPU cloud computing , 2015, INTERSPEECH.

[37] Sujay Sanghavi,et al. Faster non-convex federated learning via global and local momentum , 2020, UAI.

[38] Venkatesan Guruswami,et al. Combinatorial feature selection problems , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[39] Alexander J. Smola,et al. Efficient mini-batch training for stochastic optimization , 2014, KDD.

[40] J. Haldane. Note on the median of a multivariate distribution , 1948 .

[41] Mehmet Emre Gursoy,et al. Data Poisoning Attacks Against Federated Learning Systems , 2020, ESORICS.

[42] Sencun Zhu,et al. Backdoor Embedding in Convolutional Neural Network Models via Invisible Perturbation , 2018, CODASPY.

[43] Inderjit S. Dhillon,et al. Fast coordinate descent methods with variable selection for non-negative matrix factorization , 2011, KDD.

[44] Dan Alistarh,et al. Byzantine Stochastic Gradient Descent , 2018, NeurIPS.

[45] Chih-Jen Lin,et al. Large Linear Classification When Data Cannot Fit in Memory , 2011, TKDD.

[46] Blaine Nelson,et al. Poisoning Attacks against Support Vector Machines , 2012, ICML.

[47] P. Tseng,et al. Block-Coordinate Gradient Descent Method for Linearly Constrained Nonsmooth Separable Optimization , 2009 .

[48] Rachid Guerraoui,et al. The Hidden Vulnerability of Distributed Learning in Byzantium , 2018, ICML.

[49] Kamyar Azizzadenesheli,et al. signSGD: compressed optimisation for non-convex problems , 2018, ICML.

[50] Dong Yu,et al. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[51] Gregory Valiant,et al. Resilience: A Criterion for Learning in the Presence of Arbitrary Outliers , 2017, ITCS.

[52] J. Platt. Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .

[53] Martin Jaggi,et al. Efficient Greedy Coordinate Descent for Composite Problems , 2019, AISTATS.

[54] Zaïd Harchaoui,et al. Robust Aggregation for Federated Learning , 2019, IEEE Transactions on Signal Processing.

[55] Jerry Zheng Li,et al. Principled approaches to robust machine learning and beyond , 2018 .

[56] James Demmel,et al. Asynchronous Parallel Greedy Coordinate Descent , 2016, NIPS.

[57] Paul Tseng,et al. A coordinate gradient descent method for nonsmooth separable minimization , 2008, Math. Program..

[58] Stefano Soatto,et al. Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[59] Abhimanyu Das,et al. Submodular meets Spectral: Greedy Algorithms for Subset Selection, Sparse Approximation and Dictionary Selection , 2011, ICML.

[60] John N. Tsitsiklis,et al. Distributed Asynchronous Deterministic and Stochastic Gradient Optimization Algorithms , 1984, 1984 American Control Conference.

[61] Jia Liu,et al. Byzantine-Resilient Stochastic Gradient Descent for Distributed Learning: A Lipschitz-Inspired Coordinate-wise Median Approach , 2019, 2019 IEEE 58th Conference on Decision and Control (CDC).

[62] Arie Tamir,et al. Open questions concerning Weiszfeld's algorithm for the Fermat-Weber location problem , 1989, Math. Program..

[63] Cun-Hui Zhang,et al. The multivariate L1-median and associated data depth. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[64] Dawn Xiaodong Song,et al. Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning , 2017, ArXiv.

[65] Mark W. Schmidt,et al. Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition , 2016, ECML/PKDD.

[66] Jakub W. Pachocki,et al. Geometric median in nearly linear time , 2016, STOC.

[67] Suhas Diggavi,et al. Byzantine-Resilient SGD in High Dimensions on Heterogeneous Data , 2020, 2021 IEEE International Symposium on Information Theory (ISIT).

[68] Lili Su,et al. Distributed Statistical Machine Learning in Adversarial Settings: Byzantine Gradient Descent , 2019, PERV.

[69] Deanna Needell,et al. Paved with Good Intentions: Analysis of a Randomized Block Kaczmarz Method , 2012, ArXiv.

[70] Shie Mannor,et al. Distributed Robust Learning , 2014, ArXiv.

[71] Roland Vollgraf,et al. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[72] Martin Jaggi,et al. Error Feedback Fixes SignSGD and other Gradient Compression Schemes , 2019, ICML.

[73] Dan Alistarh,et al. The Convergence of Sparsified Gradient Methods , 2018, NeurIPS.

[74] R. Firoozian. Feedback Control Theory , 2009 .

[75] Sebastian U. Stich,et al. Local SGD Converges Fast and Communicates Little , 2018, ICLR.