Stacked calibration of off-policy policy evaluation for video game matchmaking

We consider an industrial strength application of recommendation systems for video-game matchmaking in which off-policy policy evaluation is important but where standard approaches can hardly be applied. The objective of the policy is to sequentially form teams of players from those waiting to be matched, in such a way as to produce well-balanced matches. Unfortunately, the available training data comes from a policy that is not known perfectly and that is not stochastic, making it impossible to use methods based on importance weights. Furthermore, we observe that when the estimated reward function and the policy are obtained by training from the same off-policy dataset, the policy evaluation using the estimated reward function is biased. We present a simple calibration procedure that is similar to stacked regression and that removes most of the bias, in the experiments we performed. Data collected during beta tests of Ghost Recon Online, a first person shooter from Ubisoft, were used for the experiments.

[1]  R. A. Bradley,et al.  RANK ANALYSIS OF INCOMPLETE BLOCK DESIGNS , 1952 .

[2]  R. A. Bradley,et al.  RANK ANALYSIS OF INCOMPLETE BLOCK DESIGNS THE METHOD OF PAIRED COMPARISONS , 1952 .

[3]  R. A. Bradley,et al.  Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons , 1952 .

[4]  A. Elo The rating of chessplayers, past and present , 1978 .

[5]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[6]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[7]  M. Glickman Parameter Estimation in Large Dynamic Paired Comparison Experiments , 1999 .

[8]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[9]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[10]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[11]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[12]  Chih-Jen Lin,et al.  A Generalized Bradley-Terry Model: From Group Competition to Individual Skill , 2004, NIPS.

[13]  H. J. van den Herik,et al.  Opponent Modelling and Commercial Games , 2005 .

[14]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[15]  Peta Wyeth,et al.  GameFlow: a model for evaluating player enjoyment in games , 2005, CIE.

[16]  Tom Minka,et al.  TrueSkillTM: A Bayesian Skill Rating System , 2006, NIPS.

[17]  Tony R. Martinez,et al.  A Bradley–Terry artificial neural network model for individual ratings in group competitions , 2008, Neural Computing and Applications.

[18]  Johannes Fürnkranz,et al.  Recent Advances in Machine Learning and Game Playing , 2007 .

[19]  J. Langford,et al.  The Epoch-Greedy algorithm for contextual multi-armed bandits , 2007, NIPS 2007.

[20]  Scott Counts,et al.  Personality Matters: Incorporating Detailed User Attributes and Preferences into the Matchmaking Process , 2007, 2007 40th Annual Hawaii International Conference on System Sciences (HICSS'07).

[21]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[22]  Yoshua Bengio,et al.  Neural net language models , 2008, Scholarpedia.

[23]  Darryl Charles,et al.  Toward an understanding of flow in video games , 2008, CIE.

[24]  Geoffrey E. Hinton,et al.  Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine , 2010, NIPS.

[25]  Julian Togelius,et al.  Experience-Driven Procedural Content Generation , 2011, IEEE Trans. Affect. Comput..

[26]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[27]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[28]  B. Díaz-Agudo,et al.  Matchmaking and Case-based Recommendations , 2011 .

[29]  Pascal Vincent,et al.  The Manifold Tangent Classifier , 2011, NIPS.

[30]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[31]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[32]  Douglas Eck,et al.  Temporal Pooling and Multiscale Learning for Automatic Annotation and Ranking of Music Audio , 2011, ISMIR.

[33]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[34]  Yoshua Bengio,et al.  Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription , 2012, ICML.

[35]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[36]  Jürgen Schmidhuber,et al.  Multi-column deep neural networks for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[38]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[39]  Yoshua Bengio,et al.  Beyond Skill Rating: Advanced Matchmaking in Ghost Recon Online , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[40]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[41]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[42]  Pascal Vincent,et al.  Unsupervised Feature Learning and Deep Learning: A Review and New Perspectives , 2012, ArXiv.

[43]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.