Optimal Gradient-based Algorithms for Non-concave Bandit Optimization

Bandit problems with linear or concave reward have been extensively studied, but relatively few works have studied bandits with non-concave reward. This work considers a large family of bandit problems where the unknown underlying reward function is non-concave, including the low-rank generalized linear bandit problems and two-layer neural network with polynomial activation bandit problem. For the low-rank generalized linear bandit problem, we provide a minimax-optimal algorithm in the dimension, refuting both conjectures in [LMT21, JWWN19]. Our algorithms are based on a unified zeroth-order optimization paradigm that applies in great generality and attains optimal rates in several structured polynomial settings (in the dimension). We further demonstrate the applicability of our algorithms in RL in the generative model setting, resulting in improved sample complexity over prior approaches. Finally, we show that the standard optimistic algorithms (e.g., UCB) are sub-optimal by dimension factors. In the neural net setting (with polynomial activation functions) with noiseless reward, we provide a bandit algorithm with sample complexity equal to the intrinsic algebraic dimension. Again, we show that optimistic approaches have worse sample complexity, polynomial in the extrinsic dimension (which could be exponentially worse in the polynomial degree).

[1]  Colin Wei,et al.  Shape Matters: Understanding the Implicit Bias of the Noise Covariance , 2020, COLT.

[2]  Babak Hassibi,et al.  Stochastic Linear Bandits with Hidden Low Rank Structure , 2019, ArXiv.

[3]  Tengyu Ma,et al.  Provable Model-based Nonlinear Bandit and Reinforcement Learning: Shelve Optimism, Embrace Virtual Curvature , 2021, ArXiv.

[4]  Zheng Wen,et al.  Stochastic Rank-1 Bandits , 2016, AISTATS.

[5]  Adam Tauman Kalai,et al.  Online convex optimization in the bandit setting: gradient descent without a gradient , 2004, SODA '05.

[6]  Yuanzhi Li,et al.  First Efficient Convergence for Streaming k-PCA: A Global, Gap-Free, and Near-Optimal Rate , 2016, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[7]  Tengyu Ma,et al.  Beyond Lazy Training for Over-parameterized Tensor Decomposition , 2020, NeurIPS.

[8]  Nathan Srebro,et al.  Lexicographic and Depth-Sensitive Margins in Homogeneous and Non-Homogeneous Deep Models , 2019, ICML.

[9]  Tengyu Ma,et al.  Learning One-hidden-layer Neural Networks with Landscape Design , 2017, ICLR.

[10]  Tor Lattimore,et al.  Improved Regret for Zeroth-Order Adversarial Bandit Convex Optimisation , 2020, ArXiv.

[11]  Peter L. Bartlett,et al.  Linear Programming for Large-Scale Markov Decision Problems , 2014, ICML.

[12]  Robert D. Nowak,et al.  Bilinear Bandits with Low-rank Structure , 2019, ICML.

[13]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[14]  Jason D. Lee,et al.  Beyond Linearization: On Quadratic and Higher-Order Approximation of Wide Neural Networks , 2019, ICLR.

[15]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[16]  Ruosong Wang,et al.  Reinforcement Learning with General Value Function Approximation: Provably Efficient Approach via Bounded Eluder Dimension , 2020, NeurIPS.

[17]  John N. Tsitsiklis,et al.  Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[18]  Nathan Srebro,et al.  Implicit Bias in Deep Linear Classification: Initialization Scale vs Training Accuracy , 2020, NeurIPS.

[19]  Xiaoyu Chen,et al.  Near-optimal Representation Learning for Linear Bandits and Linear RL , 2021, ICML.

[20]  Tor Lattimore,et al.  High-Dimensional Sparse Linear Bandits , 2020, NeurIPS.

[21]  Joel A. Tropp,et al.  User-Friendly Tail Bounds for Sums of Random Matrices , 2010, Found. Comput. Math..

[22]  Jason D. Lee,et al.  When Does Non-Orthogonal Tensor Decomposition Have No Spurious Local Minima? , 2019, ArXiv.

[23]  俊一 甘利 5分で分かる!? 有名論文ナナメ読み:Jacot, Arthor, Gabriel, Franck and Hongler, Clement : Neural Tangent Kernel : Convergence and Generalization in Neural Networks , 2020 .

[24]  Tengyu Ma,et al.  Online Learning of Eigenvectors , 2015, ICML.

[25]  Chi Jin,et al.  Bellman Eluder Dimension: New Rich Classes of RL Problems, and Sample-Efficient Algorithms , 2021, NeurIPS.

[26]  Maria-Florina Balcan,et al.  An Improved Gap-Dependency Analysis of the Noisy Power Method , 2016, COLT.

[27]  Cong Fang,et al.  Modeling from Features: a Mean-field Framework for Over-parameterized Deep Neural Networks , 2020, COLT.

[28]  Cho-Jui Hsieh,et al.  Convergence of Adversarial Training in Overparametrized Networks , 2019, ArXiv.

[29]  Tengyu Ma,et al.  Label Noise SGD Provably Prefers Flat Global Minimizers , 2021, NeurIPS.

[30]  S. Mendelson,et al.  Minimax rate of convergence and the performance of ERM in phase recovery , 2013, 1311.5024.

[31]  Moritz Hardt,et al.  The Noisy Power Method: A Meta Algorithm with Applications , 2013, NIPS.

[32]  Robert D. Kleinberg Nearly Tight Bounds for the Continuum-Armed Bandit Problem , 2004, NIPS.

[33]  Shachar Lovett,et al.  Bilinear Classes: A Structural Framework for Provable Generalization in RL , 2021, ICML.

[34]  Wojciech Kotlowski,et al.  Bandit Principal Component Analysis , 2019, COLT.

[35]  Zhiqiang Xu,et al.  Generalized phase retrieval : measurement number, matrix recovery and beyond , 2016, Applied and Computational Harmonic Analysis.

[36]  Nello Cristianini,et al.  Finite-Time Analysis of Kernelised Contextual Bandits , 2013, UAI.

[37]  Alessandro Lazaric,et al.  Learning Near Optimal Policies with Low Inherent Bellman Error , 2020, ICML.

[38]  Sham M. Kakade,et al.  Few-Shot Learning via Learning the Representation, Provably , 2020, ICLR.

[39]  Tor Lattimore,et al.  Bandit Phase Retrieval , 2021, ArXiv.

[40]  Benjamin Van Roy,et al.  Eluder Dimension and the Sample Complexity of Optimistic Exploration , 2013, NIPS.

[41]  Ambuj Tewari,et al.  Low-Rank Generalized Linear Bandit Problems , 2020, AISTATS.

[42]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[43]  Vidyashankar Sivakumar,et al.  Structured Stochastic Linear Bandits , 2016, ArXiv.

[44]  Emmanuel J. Candès,et al.  On the Fundamental Limits of Adaptive Sensing , 2011, IEEE Transactions on Information Theory.

[45]  Cameron Musco,et al.  Randomized Block Krylov Methods for Stronger and Faster Approximate Singular Value Decomposition , 2015, NIPS.

[46]  Aditya Gopalan,et al.  Low-rank Bandits with Latent Mixtures , 2016, ArXiv.

[47]  Yin Tat Lee,et al.  Kernel-based methods for bandit convex optimization , 2016, STOC.

[48]  Colin Wei,et al.  Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel , 2018, NeurIPS.

[49]  Tor Lattimore,et al.  Online Sparse Reinforcement Learning , 2020, ArXiv.

[50]  Raghu Meka,et al.  Learning Polynomials of Few Relevant Dimensions , 2020, COLT.

[51]  Anima Anandkumar,et al.  Tensor decompositions for learning latent variable models , 2012, J. Mach. Learn. Res..

[52]  Joan Bruna,et al.  On the Expressive Power of Deep Polynomial Neural Networks , 2019, NeurIPS.

[53]  Yuanzhi Li,et al.  Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data , 2018, NeurIPS.

[54]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[55]  Thomas P. Hayes,et al.  Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[56]  Xiaodong Li,et al.  Phase Retrieval via Wirtinger Flow: Theory and Algorithms , 2014, IEEE Transactions on Information Theory.

[57]  Yuanzhi Li,et al.  An optimal algorithm for bandit convex optimization , 2016, ArXiv.

[58]  Andrea Montanari,et al.  Linearized two-layers neural networks in high dimension , 2019, The Annals of Statistics.

[59]  Xiaodong Li,et al.  Optimal Rates of Convergence for Noisy Sparse Phase Retrieval via Thresholded Wirtinger Flow , 2015, ArXiv.

[60]  Sham M. Kakade,et al.  Stochastic Convex Optimization with Bandit Feedback , 2011, SIAM J. Optim..

[61]  Jie Zhou,et al.  Low-rank Tensor Bandits , 2020, ArXiv.

[62]  Yuanzhi Li,et al.  What Can ResNet Learn Efficiently, Going Beyond Kernels? , 2019, NeurIPS.

[63]  Handong Zhao,et al.  Neural Contextual Bandits with Deep Representation and Shallow Exploration , 2020, ICLR.

[64]  S. Kakade,et al.  Reinforcement Learning: Theory and Algorithms , 2019 .

[65]  Csaba Szepesvári,et al.  Online-to-Confidence-Set Conversions and Application to Sparse Stochastic Bandits , 2012, AISTATS.

[66]  Yu Bai,et al.  Towards Understanding Hierarchical Learning: Benefits of Neural Representations , 2020, NeurIPS.

[67]  Jasper Snoek,et al.  Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling , 2018, ICLR.

[68]  Nathan Srebro,et al.  Kernel and Deep Regimes in Overparametrized Models , 2019, ArXiv.

[69]  Anima Anandkumar,et al.  Online and Differentially-Private Tensor Decomposition , 2016, NIPS.

[70]  Wei Hu,et al.  Provable Benefits of Representation Learning in Linear Bandits , 2020, ArXiv.