Comparing Elo, Glicko, IRT, and Bayesian IRT Statistical Models for Educational and Gaming Data

Statistical models used for estimating skill or ability levels often vary by field, however their underlying mathematical models can be very similar. Differences in the underlying models can be due to the need to accommodate data with different underlying formats and structure. As the models from varying fields increase in complexity, their ability to be applied to different types of data may have the ability to increase. Models that are applied to educational or psychological data have advanced to accommodate a wide range of data formats, including increased estimation accuracy with sparsely populated data matrices. Conversely, the field of online gaming has expanded over the last two decades to include the use of more complex statistical models to provide real-time game matching based on ability estimates. It can be useful to see how statistical models from educational and gaming fields compare as different datasets may benefit from different ability estimation procedures. This study compared statistical models typically used in game match making systems (Elo, Glicko) to models used in psychometric modeling (item response theory and Bayesian item response theory) using both simulated data and real data under a variety of conditions. Results indicated that conditions with small numbers of items or matches had the most accurate skill estimates using the Bayesian IRT (item response theory) one-parameter logistic (1PL) model, regardless of whether educational or gaming data were used. This held true for all sample sizes with small numbers of items. However, the Elo and the non-Bayesian IRT 1PL models were close to the Bayesian IRT 1PL model’s estimations for both gaming and educational data. While the 2PL models were not shown to be accurate for the gaming study conditions, the IRT 2PL and Bayesian IRT 2PL models outperformed the 1PL models when 2PL educational data were generated with the larger sample size and item condition. Overall, the Bayesian IRT 1PL model seemed to be the best choice across the smaller sample and match size conditions.

[1]  D. Aldous Elo Ratings and the Sports Model: A Neglected Topic in Applied Probability? , 2017 .

[2]  R. Hambleton,et al.  Fundamentals of Item Response Theory , 1991 .

[3]  Martha L. Stocking,et al.  Developing a Common Metric in Item Response Theory , 1982 .

[4]  Lyle V. Jones,et al.  1 A History and Overview of Psychometrics , 2006 .

[5]  Rémi Coulom,et al.  Whole-History Rating: A Bayesian Rating System for Players of Time-Varying Strength , 2008, Computers and Games.

[6]  Alper Sahin,et al.  The Effects of Test Length and Sample Size on Item Parameters in Item Response Theory , 2016 .

[7]  A. R. Campbell,et al.  Predicting student success: a 10-year review using integrative review and meta-analysis. , 1996, Journal of professional nursing : official journal of the American Association of Colleges of Nursing.

[8]  Richard J. Patz,et al.  A Straightforward Approach to Markov Chain Monte Carlo Methods for Item Response Models , 1999 .

[9]  Mark D. Reckase,et al.  The Discriminating Power of Items That Measure More Than One Dimension , 1991 .

[10]  Radek Pelánek,et al.  Application of Time Decay Functions and the Elo System in Student Modeling , 2014, EDM.

[11]  M. Glickman The Glicko system , 2011 .

[12]  R. J. Mokken,et al.  Handbook of modern item response theory , 1997 .

[13]  Chih-Jen Lin,et al.  A Bayesian Approximation Method for Online Ranking , 2011, J. Mach. Learn. Res..

[14]  A. Elo The rating of chessplayers, past and present , 1978 .

[15]  W. Revelle psych: Procedures for Personality and Psychological Research , 2017 .

[16]  Wim van den Noortgate,et al.  Item difficulty estimation: An auspicious collaboration between data and judgment , 2012, Comput. Educ..

[17]  Robert A. Forsyth,et al.  An Examination of the Characteristics of Unidimensional IRT Parameter Estimates Derived From Two-Dimensional Data , 1985 .

[18]  Xiao-Li Meng,et al.  POSTERIOR PREDICTIVE ASSESSMENT OF MODEL FITNESS VIA REALIZED DISCREPANCIES , 1996 .

[19]  Radek Pelanek,et al.  Applications of the Elo rating system in adaptive educational systems , 2016, Comput. Educ..

[20]  Sarah Depaoli,et al.  The Impact of Inaccurate “Informative” Priors for Growth Parameters in Bayesian Growth Mixture Modeling , 2014 .

[21]  Giuseppe Di Fatta,et al.  Skill rating by Bayesian inference , 2009, 2009 IEEE Symposium on Computational Intelligence and Data Mining.

[22]  M. A. Best Bayesian Approaches to Clinical Trials and Health‐Care Evaluation , 2005 .

[23]  Sandjai Bhulai,et al.  The predictive power of ranking systems in association football , 2013, Int. J. Appl. Pattern Recognit..

[24]  A. Viera,et al.  Understanding interobserver agreement: the kappa statistic. , 2005, Family medicine.

[25]  David Thissen,et al.  Uses of Item Response Theory and the Testlet Concept in the Measurement of Psychopathology , 1996 .

[26]  Robert J. Mislevy,et al.  Bayesian Psychometric Modeling , 2016 .

[27]  B. gray-Little,et al.  An Item Response Theory Analysis of the Rosenberg Self-Esteem Scale , 1997 .

[28]  L. Crocker,et al.  Introduction to Classical and Modern Test Theory , 1986 .

[29]  Colton Gearhart Implementation of Gibbs Sampling within Bayesian Inference and its Applications in Actuarial Science , 2018 .

[30]  Andrew Gelman,et al.  General methods for monitoring convergence of iterative simulations , 1998 .

[31]  D. Jackson,et al.  PERSONALITY MEASURES AS PREDICTORS OF JOB PERFORMANCE: A META‐ANALYTIC REVIEW , 2006 .

[32]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[33]  Bernard P. Veldkamp,et al.  Bayesian computerized adaptive testing , 2013 .

[34]  Scott M. Lynch,et al.  Introduction to Applied Bayesian Statistics and Estimation for Social Scientists , 2007 .

[35]  F. Baker The basics of item response theory , 1985 .

[36]  T. A. Warm Weighted likelihood estimation of ability in item response theory , 1989 .

[37]  M. Glickman Parameter Estimation in Large Dynamic Paired Comparison Experiments , 1999 .

[38]  Mark J. Gierl,et al.  Testing Features of Graphical DIF: Application of a Regression Correction to Three Nonparametric Statistical Tests , 2006 .

[39]  Tom Minka,et al.  TrueSkill Through Time: Revisiting the History of Chess , 2007, NIPS.

[40]  Grantham Pang,et al.  Fabric inspection based on the Elo rating method , 2016, Pattern Recognit..

[41]  Hal S. Stern,et al.  Posterior Predictive Assessment of Item Response Theory Models , 2006 .

[42]  J. Fox Bayesian Item Response Modeling: Theory and Applications , 2010 .

[43]  Sébastien Monnet,et al.  Matchmaking in multi-player on-line games: studying user traces to improve the user experience , 2014, NOSSDAV.

[44]  Furong Gao,et al.  Bayesian or Non-Bayesian: A Comparison Study of Item Parameter Estimation in the Three-Parameter Logistic Model , 2005 .

[45]  Wim van den Noortgate,et al.  Monitoring Learners' Proficiency: Weight Adaptation in the Elo Rating System , 2011, EDM.

[46]  Lars Magnus Hvattum,et al.  Using ELO ratings for match result prediction in association football , 2010 .

[47]  Tom Minka,et al.  TrueSkillTM: A Bayesian Skill Rating System , 2006, NIPS.

[48]  Rémi Coulom,et al.  Computing "Elo Ratings" of Move Patterns in the Game of Go , 2007, J. Int. Comput. Games Assoc..

[49]  Drew A. Linzer Dynamic Bayesian Forecasting of Presidential Elections in the States , 2013 .

[50]  Sik-Yum Lee Bayesian Estimation of Structural Equation Models , 2007 .

[51]  K. Koch Introduction to Bayesian Statistics , 2007 .