Robust Training in High Dimensions via Block Coordinate Geometric Median Descent

Geometric median (Gm) is a classical method in statistics for achieving a robust estimation of the uncorrupted data; under gross corruption, it achieves the optimal breakdown point of 0.5. However, its computational complexity makes it infeasible for robustifying stochastic gradient descent (SGD) for high-dimensional optimization problems. In this paper, we show that by applying Gm to only a judiciously chosen block of coordinates at a time and using a memory mechanism, one can retain the breakdown point of 0.5 for smooth non-convex problems, with non-asymptotic convergence rates comparable to the SGD with Gm.1

[1]  P. Tseng,et al.  Block Coordinate Relaxation Methods for Nonparametric Wavelet Denoising , 2000 .

[2]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[3]  H. Robbins A Stochastic Approximation Method , 1951 .

[4]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[5]  Daniel M. Kane,et al.  Recent Advances in Algorithmic High-Dimensional Robust Statistics , 2019, ArXiv.

[6]  Indranil Gupta,et al.  Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance , 2018, ICML.

[7]  Martin Jaggi,et al.  Sparsified SGD with Memory , 2018, NeurIPS.

[8]  Xiang Li,et al.  On the Convergence of FedAvg on Non-IID Data , 2019, ICLR.

[9]  Sashank J. Reddi,et al.  SCAFFOLD: Stochastic Controlled Averaging for Federated Learning , 2019, ICML.

[10]  Thomas G. Dietterich,et al.  Benchmarking Neural Network Robustness to Common Corruptions and Perturbations , 2018, ICLR.

[11]  Jerry Li,et al.  Spectral Signatures in Backdoor Attacks , 2018, NeurIPS.

[12]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[13]  Waheed U. Bajwa,et al.  BRIDGE: Byzantine-Resilient Decentralized Gradient Descent , 2019, IEEE Transactions on Signal and Information Processing over Networks.

[14]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[15]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[16]  Wenjiang J. Fu Penalized Regressions: The Bridge versus the Lasso , 1998 .

[17]  Mark W. Schmidt,et al.  Coordinate Descent Converges Faster with the Gauss-Southwell Rule Than Random Selection , 2015, ICML.

[18]  Mark W. Schmidt,et al.  Let's Make Block Coordinate Descent Go Fast: Faster Greedy Rules, Message-Passing, Active-Set Complexity, and Superlinear Convergence , 2017 .

[19]  J. von Neumann,et al.  Probabilistic Logic and the Synthesis of Reliable Organisms from Unreliable Components , 1956 .

[20]  Siddharth Garg,et al.  BadNets: Evaluating Backdooring Attacks on Deep Neural Networks , 2019, IEEE Access.

[21]  Suhas Diggavi,et al.  Qsparse-Local-SGD: Distributed SGD With Quantization, Sparsification, and Local Computations , 2019, IEEE Journal on Selected Areas in Information Theory.

[22]  Sebastian U. Stich,et al.  Unified Optimal Analysis of the (Stochastic) Gradient Method , 2019, ArXiv.

[23]  Pradeep Ravikumar,et al.  Nearest Neighbor based Greedy Coordinate Descent , 2011, NIPS.

[24]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[25]  Hui Zhang New analysis of linear convergence of gradient-type methods via unifying error bound conditions , 2020, Math. Program..

[26]  Rachid Guerraoui,et al.  Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent , 2017, NIPS.

[27]  Pramod K. Varshney,et al.  On Distributed Stochastic Gradient Descent for Nonconvex Functions in the Presence of Byzantines , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Nirupam Gupta,et al.  Byzantine Fault-Tolerant Distributed Machine Learning Using Stochastic Gradient Descent (SGD) and Norm-Based Comparative Gradient Elimination (CGE) , 2020, ArXiv.

[29]  Ohad Shamir,et al.  Optimal Distributed Online Prediction Using Mini-Batches , 2010, J. Mach. Learn. Res..

[30]  Yurii Nesterov,et al.  Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..

[31]  Sebastian U. Stich,et al.  The Error-Feedback Framework: Better Rates for SGD with Delayed Gradients and Compressed Communication , 2019, 1909.05350.

[32]  Martin Jaggi,et al.  Approximate Steepest Coordinate Descent , 2017, ICML.

[33]  Daniel M. Kane,et al.  Robust Estimators in High Dimensions without the Computational Intractability , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[34]  G. Nemhauser,et al.  Maximizing Submodular Set Functions: Formulations and Analysis of Algorithms* , 1981 .

[35]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Nikko Strom,et al.  Scalable distributed DNN training using commodity GPU cloud computing , 2015, INTERSPEECH.

[37]  Sujay Sanghavi,et al.  Faster non-convex federated learning via global and local momentum , 2020, UAI.

[38]  Venkatesan Guruswami,et al.  Combinatorial feature selection problems , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[39]  Alexander J. Smola,et al.  Efficient mini-batch training for stochastic optimization , 2014, KDD.

[40]  J. Haldane Note on the median of a multivariate distribution , 1948 .

[41]  Mehmet Emre Gursoy,et al.  Data Poisoning Attacks Against Federated Learning Systems , 2020, ESORICS.

[42]  Sencun Zhu,et al.  Backdoor Embedding in Convolutional Neural Network Models via Invisible Perturbation , 2018, CODASPY.

[43]  Inderjit S. Dhillon,et al.  Fast coordinate descent methods with variable selection for non-negative matrix factorization , 2011, KDD.

[44]  Dan Alistarh,et al.  Byzantine Stochastic Gradient Descent , 2018, NeurIPS.

[45]  Chih-Jen Lin,et al.  Large Linear Classification When Data Cannot Fit in Memory , 2011, TKDD.

[46]  Blaine Nelson,et al.  Poisoning Attacks against Support Vector Machines , 2012, ICML.

[47]  P. Tseng,et al.  Block-Coordinate Gradient Descent Method for Linearly Constrained Nonsmooth Separable Optimization , 2009 .

[48]  Rachid Guerraoui,et al.  The Hidden Vulnerability of Distributed Learning in Byzantium , 2018, ICML.

[49]  Kamyar Azizzadenesheli,et al.  signSGD: compressed optimisation for non-convex problems , 2018, ICML.

[50]  Dong Yu,et al.  1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[51]  Gregory Valiant,et al.  Resilience: A Criterion for Learning in the Presence of Arbitrary Outliers , 2017, ITCS.

[52]  J. Platt Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .

[53]  Martin Jaggi,et al.  Efficient Greedy Coordinate Descent for Composite Problems , 2019, AISTATS.

[54]  Zaïd Harchaoui,et al.  Robust Aggregation for Federated Learning , 2019, IEEE Transactions on Signal Processing.

[55]  Jerry Zheng Li,et al.  Principled approaches to robust machine learning and beyond , 2018 .

[56]  James Demmel,et al.  Asynchronous Parallel Greedy Coordinate Descent , 2016, NIPS.

[57]  Paul Tseng,et al.  A coordinate gradient descent method for nonsmooth separable minimization , 2008, Math. Program..

[58]  Stefano Soatto,et al.  Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[59]  Abhimanyu Das,et al.  Submodular meets Spectral: Greedy Algorithms for Subset Selection, Sparse Approximation and Dictionary Selection , 2011, ICML.

[60]  John N. Tsitsiklis,et al.  Distributed Asynchronous Deterministic and Stochastic Gradient Optimization Algorithms , 1984, 1984 American Control Conference.

[61]  Jia Liu,et al.  Byzantine-Resilient Stochastic Gradient Descent for Distributed Learning: A Lipschitz-Inspired Coordinate-wise Median Approach , 2019, 2019 IEEE 58th Conference on Decision and Control (CDC).

[62]  Arie Tamir,et al.  Open questions concerning Weiszfeld's algorithm for the Fermat-Weber location problem , 1989, Math. Program..

[63]  Cun-Hui Zhang,et al.  The multivariate L1-median and associated data depth. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[64]  Dawn Xiaodong Song,et al.  Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning , 2017, ArXiv.

[65]  Mark W. Schmidt,et al.  Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition , 2016, ECML/PKDD.

[66]  Jakub W. Pachocki,et al.  Geometric median in nearly linear time , 2016, STOC.

[67]  Suhas Diggavi,et al.  Byzantine-Resilient SGD in High Dimensions on Heterogeneous Data , 2020, 2021 IEEE International Symposium on Information Theory (ISIT).

[68]  Lili Su,et al.  Distributed Statistical Machine Learning in Adversarial Settings: Byzantine Gradient Descent , 2019, PERV.

[69]  Deanna Needell,et al.  Paved with Good Intentions: Analysis of a Randomized Block Kaczmarz Method , 2012, ArXiv.

[70]  Shie Mannor,et al.  Distributed Robust Learning , 2014, ArXiv.

[71]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[72]  Martin Jaggi,et al.  Error Feedback Fixes SignSGD and other Gradient Compression Schemes , 2019, ICML.

[73]  Dan Alistarh,et al.  The Convergence of Sparsified Gradient Methods , 2018, NeurIPS.

[74]  R. Firoozian Feedback Control Theory , 2009 .

[75]  Sebastian U. Stich,et al.  Local SGD Converges Fast and Communicates Little , 2018, ICLR.

[76]  Amir Beck,et al.  On the Convergence of Block Coordinate Descent Type Methods , 2013, SIAM J. Optim..

[77]  Shai Shalev-Shwartz,et al.  Accelerated Mini-Batch Stochastic Dual Coordinate Ascent , 2013, NIPS.

[78]  Jorge Nocedal,et al.  Sample size selection in optimization methods for machine learning , 2012, Math. Program..

[79]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[80]  Jerry Li,et al.  Sever: A Robust Meta-Algorithm for Stochastic Optimization , 2018, ICML.

[81]  Shaohuai Shi,et al.  Understanding Top-k Sparsification in Distributed Deep Learning , 2019, ArXiv.

[82]  Zeyuan Allen Zhu,et al.  Even Faster Accelerated Coordinate Descent Using Non-Uniform Sampling , 2015, ICML.

[83]  Gary L. Miller,et al.  Runtime guarantees for regression problems , 2011, ITCS '13.

[84]  Peter Richtárik,et al.  Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function , 2011, Mathematical Programming.

[85]  Qing Ling,et al.  RSA: Byzantine-Robust Stochastic Aggregation Methods for Distributed Learning from Heterogeneous Datasets , 2018, AAAI.

[86]  Eric Moulines,et al.  Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[87]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[88]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[89]  Asuman E. Ozdaglar,et al.  When Cyclic Coordinate Descent Outperforms Randomized Coordinate Descent , 2017, NIPS.

[90]  Arkadi Nemirovski,et al.  Robust solutions of Linear Programming problems contaminated with uncertain data , 2000, Math. Program..

[91]  Dimitris S. Papailiopoulos,et al.  DRACO: Byzantine-resilient Distributed Training via Redundant Gradients , 2018, ICML.

[92]  Constantine Caramanis,et al.  Theory and Applications of Robust Optimization , 2010, SIAM Rev..

[93]  Aarti Singh,et al.  Provably Correct Algorithms for Matrix Column Subset Selection with Selectively Sampled Data , 2015, J. Mach. Learn. Res..

[94]  Kannan Ramchandran,et al.  Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates , 2018, ICML.

[95]  K. Ramchandran,et al.  Communication-Efficient and Byzantine-Robust Distributed Learning , 2019, 2020 Information Theory and Applications Workshop (ITA).

[96]  Stanislav Minsker Geometric median and robust estimation in Banach spaces , 2013, 1308.1334.

[97]  Ambuj Tewari,et al.  On the Nonasymptotic Convergence of Cyclic Coordinate Descent Methods , 2013, SIAM J. Optim..

[98]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .

[99]  Yanyao Shen,et al.  Learning with Bad Training Data via Iterative Trimmed Loss Minimization , 2018, ICML.

[100]  Anit Kumar Sahu,et al.  Federated Optimization in Heterogeneous Networks , 2018, MLSys.

[101]  G. Giannakis,et al.  Federated Variance-Reduced Stochastic Gradient Descent With Robustness to Byzantine Attacks , 2019, IEEE Transactions on Signal Processing.

[102]  P. Rousseeuw,et al.  Breakdown Points of Affine Equivariant Estimators of Multivariate Location and Covariance Matrices , 1991 .

[103]  A. Weber,et al.  Alfred Weber's Theory of the Location of Industries , 1930 .