Partial Monitoring - Classification, Regret Bounds, and Algorithms

In a partial monitoring game, the learner repeatedly chooses an action, the environment responds with an outcome, and then the learner suffers a loss and receives a feedback signal, both of which are fixed functions of the action and the outcome. The goal of the learner is to minimize his regret, which is the difference between his total cumulative loss and the total loss of the best fixed action in hindsight. In this paper we characterize the minimax regret of any partial monitoring game with finitely many actions and outcomes. It turns out that the minimax regret of any such game is either zero or scales as T1/2, T2/3, or T up to constants and logarithmic factors. We provide computationally efficient learning algorithms that achieve the minimax regret within a logarithmic factor for any game. In addition to the bounds on the minimax regret, if we assume that the outcomes are generated in an i.i.d. fashion, we prove individual upper bounds on the expected regret.

[1]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[2]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[3]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[4]  Dean P. Foster,et al.  Calibrated Learning and Correlated Equilibrium , 1997 .

[5]  A. Rustichini Minimizing Regret : The General Case , 1999 .

[6]  Christian Schindelhauer,et al.  Discrete Prediction Games with Arbitrary Feedback and Loss , 2001, COLT/EuroCOLT.

[7]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[8]  Frank Thomson Leighton,et al.  The value of knowing a demand curve: bounds on regret for online posted-price auctions , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[9]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[10]  Adam Tauman Kalai,et al.  Online convex optimization in the bandit setting: gradient descent without a gradient , 2004, SODA '05.

[11]  Gábor Lugosi,et al.  Minimizing regret with label efficient prediction , 2004, IEEE Transactions on Information Theory.

[12]  Nicolò Cesa-Bianchi,et al.  Regret Minimization Under Partial Monitoring , 2006, 2006 IEEE Information Theory Workshop - ITW '06 Punta del Este.

[13]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[14]  Yishay Mansour,et al.  From External to Internal Regret , 2005, J. Mach. Learn. Res..

[15]  Elad Hazan,et al.  Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization , 2008, COLT.

[16]  Shie Mannor,et al.  Strategies for Prediction Under Imperfect Monitoring , 2007, Math. Oper. Res..

[17]  Jean-Yves Audibert,et al.  Minimax Policies for Adversarial and Stochastic Bandits. , 2009, COLT 2009.

[18]  Jacob D. Abernethy,et al.  Beating the adaptive bandit with high probability , 2009, 2009 Information Theory and Applications Workshop.

[19]  Peter L. Bartlett,et al.  Optimal Allocation Strategies for the Dark Pool Problem , 2010, AISTATS.

[20]  Csaba Szepesvári,et al.  Toward a classification of finite partial-monitoring games , 2010, Theor. Comput. Sci..

[21]  Vianney Perchet,et al.  Internal Regret with Partial Monitoring: Calibration-Based Optimal Algorithms , 2011, J. Mach. Learn. Res..

[22]  Csaba Szepesvári,et al.  Minimax Regret of Finite Partial-Monitoring Games in Stochastic Environments , 2011, COLT.

[23]  Dean P. Foster,et al.  No Internal Regret via Neighborhood Watch , 2011, AISTATS.

[24]  Csaba Szepesvári,et al.  Partial Monitoring with Side Information , 2012, ALT.

[25]  Csaba Szepesvári,et al.  An adaptive algorithm for finite stochastic partial monitoring , 2012, ICML.

[26]  Csaba Szepesvári,et al.  Toward a classification of finite partial-monitoring games , 2010, Theor. Comput. Sci..

[27]  Gábor Bartók,et al.  A near-optimal algorithm for finite partial-monitoring games against adversarial opponents , 2013, COLT.