Zeroth-Order Regularized Optimization (ZORO): Approximately Sparse Gradients and Adaptive Sampling

We consider the problem of minimizing a high-dimensional objective function, which may include a regularization term, using (possibly noisy) evaluations of the function. Such optimization is also called derivative-free, zeroth-order, or black-box optimization. We propose a new $\textbf{Z}$eroth-$\textbf{O}$rder $\textbf{R}$egularized $\textbf{O}$ptimization method, dubbed ZORO. When the underlying gradient is approximately sparse at an iterate, ZORO needs very few objective function evaluations to obtain a new iterate that decreases the objective function. We achieve this with an adaptive, randomized gradient estimator, followed by an inexact proximal-gradient scheme. Under a novel approximately sparse gradient assumption and various different convex settings, we show the (theoretical and empirical) convergence rate of ZORO is only logarithmically dependent on the problem dimension. Numerical experiments show that ZORO outperforms the existing methods with similar assumptions, on both synthetic and real datasets.

[1]  David E. Cox,et al.  ZO-AdaMM: Zeroth-Order Adaptive Momentum Method for Black-Box Optimization , 2019, NeurIPS.

[2]  Lin Xiao,et al.  Optimal Algorithms for Online Convex Optimization with Multi-Point Bandit Feedback. , 2010, COLT 2010.

[3]  Sivaraman Balakrishnan,et al.  Stochastic Zeroth-order Optimization in High Dimensions , 2017, AISTATS.

[4]  Hui Zhang,et al.  The restricted strong convexity revisited: analysis of equivalence to error bound and quadratic growth , 2015, Optim. Lett..

[5]  Richard E. Turner,et al.  Structured Evolution with Compact Architectures for Scalable Policy Optimization , 2018, ICML.

[6]  Cho-Jui Hsieh,et al.  A Comprehensive Linear Speedup Analysis for Asynchronous Stochastic Parallel Optimization from Zeroth-Order to First-Order , 2016, NIPS.

[7]  S. Foucart Sparse Recovery Algorithms: Sufficient Conditions in Terms of RestrictedIsometry Constants , 2012 .

[8]  Deanna Needell,et al.  CoSaMP: Iterative signal recovery from incomplete and inaccurate samples , 2008, ArXiv.

[9]  Samy Bengio,et al.  Adversarial Machine Learning at Scale , 2016, ICLR.

[10]  Ohad Shamir,et al.  On the Complexity of Bandit and Derivative-Free Stochastic Convex Optimization , 2012, COLT.

[11]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Yazid M. Sharaiha,et al.  Heuristics for cardinality constrained portfolio optimisation , 2000, Comput. Oper. Res..

[13]  Nando de Freitas,et al.  Bayesian Optimization in High Dimensions via Random Embeddings , 2013, IJCAI.

[14]  Hui Zhang,et al.  Restricted strong convexity and its applications to convergence analysis of gradient-type methods in convex optimization , 2015, Optim. Lett..

[15]  J. Kiefer,et al.  Stochastic Estimation of the Maximum of a Regression Function , 1952 .

[16]  Bin Gu,et al.  Faster Derivative-Free Stochastic Algorithm for Shared Memory Machines , 2018, ICML.

[17]  Saeed Ghadimi,et al.  Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[18]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[19]  Ananthram Swami,et al.  Practical Black-Box Attacks against Machine Learning , 2016, AsiaCCS.

[20]  Alfred O. Hero,et al.  Zeroth-Order Online Alternating Direction Method of Multipliers: Convergence Analysis and Applications , 2017, AISTATS.

[21]  Mark W. Schmidt,et al.  Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization , 2011, NIPS.

[22]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[23]  Flagot Yohannes Derivative free optimization methods , 2012 .

[24]  Ben Taskar,et al.  Learning structured prediction models: a large margin approach , 2005, ICML.

[25]  Mingyi Hong,et al.  signSGD via Zeroth-Order Oracle , 2019, ICLR.

[26]  Yurii Nesterov,et al.  Random Gradient-Free Minimization of Convex Functions , 2015, Foundations of Computational Mathematics.

[27]  J. Kadane,et al.  Design for low‐temperature microwave‐assisted crystallization of ceramic thin films , 2017 .

[28]  Adam Tauman Kalai,et al.  Online convex optimization in the bandit setting: gradient descent without a gradient , 2004, SODA '05.

[29]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[30]  Krishnakumar Balasubramanian,et al.  Zeroth-order (Non)-Convex Stochastic Optimization via Conditional Gradient and Gradient Updates , 2018, NeurIPS.

[31]  R. DeVore,et al.  A Simple Proof of the Restricted Isometry Property for Random Matrices , 2008 .

[32]  James C. Spall,et al.  AN OVERVIEW OF THE SIMULTANEOUS PERTURBATION METHOD FOR EFFICIENT OPTIMIZATION , 1998 .

[33]  Martin J. Wainwright,et al.  Optimal Rates for Zero-Order Convex Optimization: The Power of Two Function Evaluations , 2013, IEEE Transactions on Information Theory.

[34]  Robert D. Nowak,et al.  Query Complexity of Derivative-Free Optimization , 2012, NIPS.

[35]  Seyed-Mohsen Moosavi-Dezfooli,et al.  SparseFool: A Few Pixels Make a Big Difference , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Konstantin E. Tikhomirov,et al.  Singularity of random Bernoulli matrices , 2018, Annals of Mathematics.

[37]  Frank Schöpfer,et al.  Linear Convergence of Descent Methods for the Unconstrained Minimization of Restricted Strongly Convex Functions , 2016, SIAM J. Optim..