The Evaluation of Rating Systems in Online Free-for-All Games

Online competitive games have become increasingly popular. To ensure an exciting and competitive environment, these games routinely attempt to match players with similar skill levels. Matching players is often accomplished through a rating system. There has been an increasing amount of research on developing such rating systems. However, less attention has been given to the evaluation metrics of these systems. In this paper, we present an exhaustive analysis of six metrics for evaluating rating systems in online competitive games. We compare traditional metrics such as accuracy. We then introduce other metrics adapted from the field of information retrieval. We evaluate these metrics against several well-known rating systems on a large real-world dataset of over 100,000 free-for-all matches. Our results show stark differences in their utility. Some metrics do not consider deviations between two ranks. Others are inordinately impacted by new players. Many do not capture the importance of distinguishing between errors in higher ranks and lower ranks. Among all metrics studied, we recommend Normalized Discounted Cumulative Gain (NDCG) because not only does it resolve the issues faced by other metrics, but it also offers flexibility to adjust the evaluations based on the goals of the system

[1]  Tom Minka,et al.  TrueSkillTM: A Bayesian Skill Rating System , 2006, NIPS.

[2]  Dmitry I. Ignatov,et al.  Predicting Winning Team and Probabilistic Ratings in "Dota 2" and "Counter-Strike: Global Offensive" Video Games , 2017, AIST.

[3]  Dominik Deja,et al.  Developing Game-Structure Sensitive Matchmaking System for Massive-Multiplayer Online Games , 2014, SocInfo Workshops.

[4]  A. Elo The rating of chessplayers, past and present , 1978 .

[5]  M. Kendall Rank Correlation Methods , 1949 .

[6]  Marius St,et al.  Rating systems with multiple factors , 2011 .

[7]  Tom Minka,et al.  A family of algorithms for approximate Bayesian inference , 2001 .

[8]  Naoki Masuda,et al.  A network-based dynamical ranking system for competitive sports , 2012, Scientific Reports.

[9]  Patricia Paderewski,et al.  Continuous Assessment in Educational Video Games: A Roleplaying approach , 2014, Interacción '14.

[10]  Yoshua Bengio,et al.  Beyond Skill Rating: Advanced Matchmaking in Ghost Recon Online , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[11]  Thomas Hofmann,et al.  TrueSkill™: A Bayesian Skill Rating System , 2007 .

[12]  Brendan J. Frey,et al.  Factor graphs and the sum-product algorithm , 2001, IEEE Trans. Inf. Theory.

[13]  G Salton,et al.  Developments in Automatic Text Retrieval , 1991, Science.

[14]  M. Glickman The Glicko system , 2011 .

[15]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[16]  Seth Cooper,et al.  Player Rating Systems for Balancing Human Computation Games: Testing the Effect of Bipartiteness , 2016, DiGRA/FDG.

[17]  Cho-Jui Hsieh,et al.  Learning from Group Comparisons: Exploiting Higher Order Interactions , 2018, NeurIPS.

[18]  Lei Zhang,et al.  A Factor-Based Model for Context-Sensitive Skill Rating Systems , 2010, 2010 22nd IEEE International Conference on Tools with Artificial Intelligence.

[19]  Nachiappan Nagappan,et al.  Mastering the art of war: how patterns of gameplay influence skill in Halo , 2013, CHI.

[20]  Tony R. Martinez,et al.  A Bradley–Terry artificial neural network model for individual ratings in group competitions , 2008, Neural Computing and Applications.

[21]  Ke Chen,et al.  Predicting skill from gameplay input to a first-person shooter , 2013, 2013 IEEE Conference on Computational Inteligence in Games (CIG).

[22]  Robert Hubal,et al.  Predicting Students’ Decisions in a Training Simulation: A Novel Application of TrueSkill , 2018, IEEE Transactions on Games.

[23]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[24]  Scott Sanner,et al.  Score-Based Bayesian Skill Learning , 2012, ECML/PKDD.

[25]  Jaideep Srivastava,et al.  TeamSkill: Modeling Team Chemistry in Online Multi-player Games , 2011, PAKDD.

[26]  M. Glickman Parameter Estimation in Large Dynamic Paired Comparison Experiments , 1999 .

[27]  Julia Ibstedt,et al.  Application and Further Development of TrueSkill™ Ranking in Sports , 2019 .

[28]  Breanna Morrison,et al.  Comparing Elo, Glicko, IRT, and Bayesian IRT Statistical Models for Educational and Gaming Data , 2019 .

[29]  Ellen M. Voorhees,et al.  The TREC-8 Question Answering Track Report , 1999, TREC.

[30]  José Eduardo Ochoa Luna,et al.  A Content-Based Recommendation System Using TrueSkill , 2015, 2015 Fourteenth Mexican International Conference on Artificial Intelligence (MICAI).

[31]  Chih-Jen Lin,et al.  A Bayesian Approximation Method for Online Ranking , 2011, J. Mach. Learn. Res..

[32]  Joaquin Quiñonero Candela,et al.  Web-Scale Bayesian Click-Through rate Prediction for Sponsored Search Advertising in Microsoft's Bing Search Engine , 2010, ICML.

[33]  Sandjai Bhulai,et al.  The predictive power of ranking systems in association football , 2013, Int. J. Appl. Pattern Recognit..

[34]  Xing Xie,et al.  MOBA-Slice: A Time Slice Based Evaluation Framework of Relative Advantage between Teams in MOBA Games , 2018, CGW@IJCAI.

[35]  Ke Chen,et al.  Rapid Skill Capture in a First-Person Shooter , 2014, IEEE Transactions on Computational Intelligence and AI in Games.

[36]  Thorsten Joachims,et al.  Predicting Matchups and Preferences in Context , 2016, KDD.