Sequential Decision Making with Combinatorial Actions and High-Dimensional Contexts

In interactive sequential decision-making systems, the learning agent needs to react to new information both in the short term and in the long term, and learn to generalize through repeated interactions with the environment. Unlike in offline learning environments, the new data that arrives is typically a function of previous actions taken by the agent. One of the key challenges is to efficiently use and generalize from data that may never reappear. Furthermore, in many real-world applications, the agent only receives partial feedback on the decisions it makes. This necessitates a balanced explorationexploitation approach, where the agent needs to both efficiently collect relevant information in order to prepare for future arrivals of feedback, and produce the desired outcome in the current periods by exploiting the already collected information. In this thesis, we focus on two classes of fundamental sequential learning problems: Contextual bandits with combinatorial actions and user choice (Chapter 2 and Chapter 3): We investigate the dynamic assortment selection problem by combining statistical estimation of choice models and generalization using contextual information. For this problem, we design and analyze both UCB and Thomson sampling algorithms with rigorous performance guarantees and tractability. High-dimensional contextual bandits (Chapter 4): We investigate policies that can efficiently exploit the structure in high-dimensional data, e.g., sparsity. We design and analyze an efficient sparse contextual bandit algorithm that does not require to know the sparsity of the underlying parameter – information that essentially all existing sparse bandit algorithms to date require.

[1]  Milton Abramowitz,et al.  Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables , 1964 .

[2]  Cun-Hui Zhang Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.

[3]  R. Plackett The Analysis of Permutations , 1975 .

[4]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[5]  David B. Shmoys,et al.  Dynamic Assortment Optimization with a Multinomial Logit Choice Model and Capacity Constraint , 2010, Oper. Res..

[6]  Martin J. Wainwright,et al.  High-Dimensional Statistics , 2019 .

[7]  Xue Wang,et al.  Minimax Concave Penalized Multi-Armed Bandit Model with High-Dimensional Convariates , 2018, ICML.

[8]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[9]  Wei Cao,et al.  On Top-k Selection in Multi-Armed Bandits and Hidden Bipartite Graphs , 2015, NIPS.

[10]  David Simchi-Levi,et al.  Assortment Optimization under Unknown MultiNomial Logit Choice Models , 2017, ArXiv.

[11]  Alessandro Lazaric,et al.  Linear Thompson Sampling Revisited , 2016, AISTATS.

[12]  Elad Hazan,et al.  Logarithmic regret algorithms for online convex optimization , 2006, Machine Learning.

[13]  P. Bickel,et al.  SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR , 2008, 0801.1095.

[14]  Vashist Avadhanula,et al.  Thompson Sampling for the MNL-Bandit , 2017, COLT.

[15]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[16]  Daniel McFadden,et al.  Modelling the Choice of Residential Location , 1977 .

[17]  Jasper Snoek,et al.  Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling , 2018, ICLR.

[18]  Xi Chen,et al.  A Note on Tight Lower Bound for MNL-Bandit Assortment Selection Models , 2017, ArXiv.

[19]  Nando de Freitas,et al.  An Introduction to MCMC for Machine Learning , 2004, Machine Learning.

[20]  Lihong Li,et al.  Provable Optimal Algorithms for Generalized Linear Contextual Bandits , 2017, ArXiv.

[21]  Gi-Soo Kim,et al.  Doubly-Robust Lasso Bandit , 2019, NeurIPS.

[22]  Rong Jin,et al.  Multinomial Logit Bandit with Linear Utility Functions , 2018, IJCAI.

[23]  Fernando Bernstein,et al.  A Dynamic Clustering Approach to Data-Driven Assortment Personalization , 2018, Manag. Sci..

[24]  Assaf J. Zeevi,et al.  Optimal Dynamic Assortment Planning with Demand Learning , 2013, Manuf. Serv. Oper. Manag..

[25]  Xiaoyan Zhu,et al.  Contextual Combinatorial Bandit and its Application on Diversified Online Recommendation , 2014, SDM.

[26]  Thomas P. Hayes,et al.  Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[27]  Mohsen Bayati,et al.  Online Decision-Making with High-Dimensional Covariates , 2015 .

[28]  Shuai Li,et al.  Contextual Combinatorial Cascading Bandits , 2016, ICML.

[29]  David Simchi-Levi,et al.  Thompson Sampling for Online Personalized Assortment Optimization Problems with Multinomial Logit Choice Models , 2017 .

[30]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[31]  Madeleine Udell,et al.  Dynamic Assortment Personalization in High Dimensions , 2016, Oper. Res..

[32]  Zheng Wen,et al.  Cascading Bandits: Learning to Rank in the Cascade Model , 2015, ICML.

[33]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[34]  Beibei Li,et al.  Examining the Impact of Ranking on Consumer Behavior and Search Engine Revenue , 2013, Manag. Sci..

[35]  Benjamin Van Roy,et al.  A Tutorial on Thompson Sampling , 2017, Found. Trends Mach. Learn..

[36]  Adel Javanmard,et al.  Dynamic Pricing in High-Dimensions , 2016, J. Mach. Learn. Res..

[37]  Branislav Kveton,et al.  Efficient Learning in Large-Scale Combinatorial Semi-Bandits , 2014, ICML.

[38]  Danny Segev,et al.  Greedy-Like Algorithms for Dynamic Assortment Planning Under Multinomial Logit Preferences , 2015, Oper. Res..

[39]  Renyuan Xu,et al.  Learning in Generalized Linear Contextual Bandits with Stochastic Delays , 2019, NeurIPS.

[40]  Craig Boutilier,et al.  Randomized Exploration in Generalized Linear Bandits , 2019, AISTATS.

[41]  Huseyin Topaloglu,et al.  Assortment Optimization Under Variants of the Nested Logit Model , 2014, Oper. Res..

[42]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[43]  Xi Chen,et al.  Dynamic Assortment Optimization with Changing Contextual Information , 2018, J. Mach. Learn. Res..

[44]  Martin J. Wainwright,et al.  Restricted Eigenvalue Properties for Correlated Gaussian Designs , 2010, J. Mach. Learn. Res..

[45]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[46]  Vineet Goyal,et al.  Near-Optimal Algorithms for Capacity Constrained Assortment Optimization , 2014 .

[47]  Zheng Wen,et al.  Cascading Bandits for Large-Scale Recommendation Problems , 2016, UAI.

[48]  Benjamin Van Roy,et al.  Learning to Optimize via Posterior Sampling , 2013, Math. Oper. Res..

[49]  G. Gallego,et al.  Assortment Planning Under the Multinomial Logit Model with Totally Unimodular Constraint Structures , 2013 .

[50]  Elad Hazan,et al.  Logistic Regression: Tight Bounds for Stochastic and Online Optimization , 2014, COLT.

[51]  D. Pollard Empirical Processes: Theory and Applications , 1990 .

[52]  Rebecca Willett,et al.  Sparse linear contextual bandits via relevance vector machines , 2017, 2017 International Conference on Sampling Theory and Applications (SampTA).

[53]  Aurélien Garivier,et al.  Parametric Bandits: The Generalized Linear Case , 2010, NIPS.

[54]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[55]  R. Luce,et al.  Individual Choice Behavior: A Theoretical Analysis. , 1960 .

[56]  Rémi Munos,et al.  Bandit Theory meets Compressed Sensing for high dimensional Stochastic Linear Bandit , 2012, AISTATS.

[57]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[58]  Sara van de Geer,et al.  Statistics for High-Dimensional Data: Methods, Theory and Applications , 2011 .

[59]  John Langford,et al.  Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits , 2014, ICML.

[60]  Philip M. Long,et al.  Associative Reinforcement Learning using Linear Probabilistic Concepts , 1999, ICML.

[61]  Malcolm J. A. Strens,et al.  A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[62]  Ambuj Tewari,et al.  From Ads to Interventions: Contextual Bandits in Mobile Health , 2017, Mobile Health - Sensors, Analytic Methods, and Applications.

[63]  R. Duncan Luce,et al.  Individual Choice Behavior: A Theoretical Analysis , 1979 .

[64]  J. Robins,et al.  Doubly Robust Estimation in Missing Data and Causal Inference Models , 2005, Biometrics.

[65]  Joel A. Tropp,et al.  User-Friendly Tail Bounds for Sums of Random Matrices , 2010, Found. Comput. Math..

[66]  John Langford,et al.  The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information , 2007, NIPS.

[67]  Garrett J. van Ryzin,et al.  Revenue Management Under a General Discrete Choice Model of Consumer Behavior , 2004, Manag. Sci..

[68]  A. Zeevi,et al.  A Linear Response Bandit Problem , 2013 .

[69]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[70]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[71]  John N. Tsitsiklis,et al.  Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[72]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[73]  Csaba Szepesvári,et al.  Online-to-Confidence-Set Conversions and Application to Sparse Stochastic Bandits , 2012, AISTATS.

[74]  Kani Chen,et al.  Strong consistency of maximum quasi-likelihood estimators in generalized linear models with fixed and adaptive designs , 1999 .

[75]  G. Simons,et al.  On the theory of elliptically contoured distributions , 1981 .

[76]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[77]  P. Bartlett,et al.  Local Rademacher complexities , 2005, math/0508275.

[78]  S. Geer HIGH-DIMENSIONAL GENERALIZED LINEAR MODELS AND THE LASSO , 2008, 0804.0703.

[79]  Felipe Caro,et al.  Dynamic Assortment with Demand Learning for Seasonal Consumer Goods , 2007, Manag. Sci..

[80]  Rémi Munos,et al.  Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[81]  Khashayar Khosravi,et al.  Mostly Exploration-Free Algorithms for Contextual Bandits , 2017, Manag. Sci..

[82]  Zhi-Hua Zhou,et al.  Online Stochastic Linear Optimization under One-bit Feedback , 2015, ICML.

[83]  S. Geer,et al.  On the conditions used to prove oracle results for the Lasso , 2009, 0910.0722.

[84]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.