论文信息 - Online Learning with Gaussian Payoffs and Side Observations - 字舞流文

Online Learning with Gaussian Payoffs and Side Observations

We consider a sequential learning problem with Gaussian payoffs and side information: after selecting an action $i$, the learner receives information about the payoff of every action $j$ in the form of Gaussian observations whose mean is the same as the mean payoff, but the variance depends on the pair $(i,j)$ (and may be infinite). The setup allows a more refined information transfer from one action to another than previous partial monitoring setups, including the recently introduced graph-structured feedback case. For the first time in the literature, we provide non-asymptotic problem-dependent lower bounds on the regret of any algorithm, which recover existing asymptotic problem-dependent lower bounds and finite-time minimax lower bounds available in the literature. We also provide algorithms that achieve the problem-dependent lower bound (up to some universal constant factor) or the minimax lower bounds (up to logarithmic factors).

Yifan Wu | András György | Csaba Szepesvári | Csaba Szepesvari | A. György | Yifan Wu

[1] D. Teneketzis,et al. Asymptotically Efficient Adaptive Allocation Schemes for Controlled I.I.D. Processes: Finite Paramet , 1988 .

[2] T. L. Graves,et al. Asymptotically Efficient Adaptive Choice of Control Laws inControlled Markov Chains , 1997 .

[3] Gábor Lugosi,et al. Prediction, learning, and games , 2006 .

[4] H. Robbins. Some aspects of the sequential design of experiments , 1952 .

[5] Shie Mannor,et al. From Bandits to Experts: On the Value of Side-Observations , 2011, NIPS.

[6] Csaba Szepesvári,et al. Minimax Regret of Finite Partial-Monitoring Games in Stochastic Environments , 2011, COLT.

[7] Marc Lelarge,et al. Leveraging Side Observations in Stochastic Bandits , 2012, UAI.

[8] Sébastien Bubeck,et al. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[9] Noga Alon,et al. From Bandits to Experts: A Tale of Domination and Independence , 2013, NIPS.

[10] Wei Chen,et al. Combinatorial Partial Monitoring Game with Linear Feedback and Its Applications , 2014, ICML.

[11] Alexandre Proutière,et al. Unimodal Bandits: Regret Lower Bounds and Optimal Algorithms , 2014, ICML.

[12] Csaba Szepesvári,et al. Partial Monitoring - Classification, Regret Bounds, and Algorithms , 2014, Math. Oper. Res..

[13] Alexandre Proutière,et al. Lipschitz Bandits: Regret Lower Bound and Optimal Algorithms , 2014, COLT.

[14] Tor Lattimore,et al. On Learning the Optimal Waiting Time , 2014, ALT.

[15] Rémi Munos,et al. Efficient learning by implicit exploration in bandit problems with side observations , 2014, NIPS.

[16] Atilla Eryilmaz,et al. Stochastic bandits with side observations on networks , 2014, SIGMETRICS '14.

[17] Lihong Li,et al. Toward Minimax Off-policy Value Estimation , 2015, AISTATS.

[18] Noga Alon,et al. Online Learning with Feedback Graphs: Beyond Bandits , 2015, COLT.

[19] Tor Lattimore,et al. Optimally Confident UCB : Improved Regret for Finite-Armed Bandits , 2015, ArXiv.

[20] Aurélien Garivier,et al. On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models , 2014, J. Mach. Learn. Res..