论文信息 - Scalable Generalized Linear Bandits: Online Computation and Hashing - 字舞流文

Scalable Generalized Linear Bandits: Online Computation and Hashing

Generalized Linear Bandits (GLBs), a natural extension of the stochastic linear bandits, has been popular and successful in recent years. However, existing GLBs scale poorly with the number of rounds and the number of arms, limiting their utility in practice. This paper proposes new, scalable solutions to the GLB problem in two respects. First, unlike existing GLBs, whose per-time-step space and time complexity grow at least linearly with time $t$, we propose a new algorithm that performs online computations to enjoy a constant space and time complexity. At its heart is a novel Generalized Linear extension of the Online-to-confidence-set Conversion (GLOC method) that takes \emph{any} online learning algorithm and turns it into a GLB algorithm. As a special case, we apply GLOC to the online Newton step algorithm, which results in a low-regret GLB algorithm with much lower time and memory complexity than prior work. Second, for the case where the number $N$ of arms is very large, we propose new algorithms in which each next arm is selected via an inner product search. Such methods can be implemented via hashing algorithms (i.e., "hash-amenable") and result in a time complexity sublinear in $N$. While a Thompson sampling extension of GLOC is hash-amenable, its regret bound for $d$-dimensional arm sets scales with $d^{3/2}$, whereas GLOC's regret bound scales with $d$. Towards closing this gap, we propose a new hash-amenable algorithm whose regret bound scales with $d^{5/4}$. Finally, we propose a fast approximate hash-key computation (inner product) with a better accuracy than the state-of-the-art, which can be of independent interest. We conclude the paper with preliminary experimental results confirming the merits of our methods.

Robert D. Nowak | Rebecca Willett | Kwang-Sung Jun | Aniruddha Bhargava | R. Nowak | R. Willett | Kwang-Sung Jun | Aniruddha Bhargava

[1] Lihong Li,et al. Provable Optimal Algorithms for Generalized Linear Contextual Bandits , 2017, ArXiv.

[2] Nicole Immorlica,et al. Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[3] Alessandro Lazaric,et al. Linear Thompson Sampling Revisited , 2016, AISTATS.

[4] Sanjiv Kumar,et al. Quantization based Fast Inner Product Search , 2015, AISTATS.

[5] Thomas S. Huang,et al. Relevance feedback: a power tool for interactive content-based image retrieval , 1998, IEEE Trans. Circuits Syst. Video Technol..

[6] Elad Hazan,et al. Logarithmic regret algorithms for online convex optimization , 2006, Machine Learning.

[7] Adam Tauman Kalai,et al. The Isotron Algorithm: High-Dimensional Isotonic Regression , 2009, COLT.

[8] Claudio Gentile,et al. On multilabel classification and ranking with bandit feedback , 2014, J. Mach. Learn. Res..

[9] Lihong Li,et al. An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[10] Michael I. Jordan,et al. Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[11] Yisong Yue,et al. Hierarchical Exploration for Accelerating Contextual Bandits , 2012, ICML.

[12] Thomas P. Hayes,et al. Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[13] Aurélien Garivier,et al. Parametric Bandits: The Generalized Linear Case , 2010, NIPS.

[14] John N. Tsitsiklis,et al. Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[15] Wei Chu,et al. Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[16] Heng Tao Shen,et al. Hashing for Similarity Search: A Survey , 2014, ArXiv.

[17] Dorota Glowacka,et al. Content-based image retrieval with hierarchical Gaussian Process bandits with self-organizing maps , 2013, ESANN.

[18] Katja Hofmann,et al. Contextual Bandits for Information Retrieval , 2011 .

[19] Csaba Szepesvári,et al. Online-to-Confidence-Set Conversions and Application to Sparse Stochastic Bandits , 2012, AISTATS.

[20] Junfeng He,et al. Optimal Parameters for Locality-Sensitive Hashing , 2012, Proceedings of the IEEE.

[21] Zhi-Hua Zhou,et al. Online Stochastic Linear Optimization under One-bit Feedback , 2015, ICML.

[22] Nathan Srebro,et al. On Symmetric and Asymmetric LSHs for Inner Product Search , 2014, ICML.

[23] Claudio Gentile,et al. Selective sampling and active learning from single and multiple teachers , 2012, J. Mach. Learn. Res..

[24] Csaba Szepesvári,et al. Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[25] Wei Chu,et al. A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[26] P. McCullagh,et al. Generalized Linear Models , 1984 .

[27] Claudio Gentile,et al. Beyond Logarithmic Bounds in Online Learning , 2012, AISTATS.

[28] Ping Li,et al. Improved Asymmetric Locality Sensitive Hashing (ALSH) for Maximum Inner Product Search (MIPS) , 2014, UAI.

[29] Wei Chu,et al. An unbiased offline evaluation of contextual bandit algorithms with generalized linear models , 2011 .

[30] Peter Auer,et al. Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[31] Piotr Indyk,et al. Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[32] Dorota Glowacka,et al. Balancing Exploration and Exploitation: Empirical Parameterization of Exploratory Search Systems , 2015, CIKM.

[33] Ping Li,et al. Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS) , 2014, NIPS.

[34] Elad Hazan,et al. Volumetric Spanners: An Efficient Exploration Basis for Learning , 2013, J. Mach. Learn. Res..

[35] Prateek Jain,et al. Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36] Shipra Agrawal,et al. Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[37] Xiaoyan Zhu,et al. Contextual Combinatorial Bandit and its Application on Diversified Online Recommendation , 2014, SDM.

[38] Koby Crammer,et al. Multiclass classification with bandit feedback using adaptive regularization , 2012, Machine Learning.