Re-evaluating evaluation

Progress in machine learning is measured by careful evaluation on problems of outstanding common interest. However, the proliferation of benchmark suites and environments, adversarial attacks, and other complications has diluted the basic evaluation model by overwhelming researchers with choices. Deliberate or accidental cherry picking is increasingly likely, and designing well-balanced evaluation suites requires increasing effort. In this paper we take a step back and propose Nash averaging. The approach builds on a detailed analysis of the algebraic structure of evaluation in two basic scenarios: agent-vs-agent and agent-vs-task. The key strength of Nash averaging is that it automatically adapts to redundancies in evaluation data, so that results are not biased by the incorporation of easy tasks or weak agents. Nash averaging thus encourages maximally inclusive evaluation -- since there is no harm (computational cost aside) from including all available tasks and agents.

[1]  Shane Legg,et al.  Psychlab: A Psychology Laboratory for Deep Reinforcement Learning Agents , 2018, ArXiv.

[2]  José Hernández-Orallo,et al.  Evaluation in artificial intelligence: from task-oriented to ability-oriented measurement , 2017, Artificial Intelligence Review.

[3]  William H. Sandholm,et al.  Population Games And Evolutionary Dynamics , 2010, Economic learning and social evolution.

[4]  Selmer Bringsjord,et al.  Psychometric artificial intelligence , 2011, J. Exp. Theor. Artif. Intell..

[5]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[6]  P. Diaconis Group representations in probability and statistics , 1988 .

[7]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[8]  E. Rowland Theory of Games and Economic Behavior , 1946, Nature.

[9]  Wm. R. Wright General Intelligence, Objectively Determined and Measured. , 1905 .

[10]  Thomas Hofmann,et al.  TrueSkill™: A Bayesian Skill Rating System , 2007 .

[11]  Shane Legg,et al.  A Universal Measure of Intelligence for Artificial Agents , 2005, IJCAI.

[12]  Thore Graepel,et al.  The Mechanics of n-Player Differentiable Games , 2018, ICML.

[13]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[14]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[15]  Marcus Frean,et al.  Rock–scissors–paper and the survival of the weakest , 2001, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[16]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[17]  J. Hernández-Orallo,et al.  AI results for the Atari 2600 games : difficulty and discrimination using IRT , 2017 .

[18]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[19]  José Hernández-Orallo,et al.  An experimental comparison of performance measures for classification , 2009, Pattern Recognit. Lett..

[20]  Shane Legg,et al.  DeepMind Lab , 2016, ArXiv.

[21]  M. Feldman,et al.  Local dispersal promotes biodiversity in a real-life game of rock–paper–scissors , 2002, Nature.

[22]  Robert A. Laird,et al.  Competitive Intransitivity Promotes Species Coexistence , 2006, The American Naturalist.

[23]  Michael P. Wellman Methods for Empirical Game-Theoretic Analysis , 2006, AAAI.

[24]  Luis E. Ortiz,et al.  Maximum Entropy Correlated Equilibria , 2007, AISTATS.

[25]  Tony Jebara,et al.  A Kernel Between Sets of Vectors , 2003, ICML.

[26]  Robert E. Schapire,et al.  Instance-dependent Regret Bounds for Dueling Bandits , 2016, COLT.

[27]  Randal S. Olson,et al.  PMLB: a large benchmark suite for machine learning evaluation and comparison , 2017, BioData Mining.

[28]  I. Kondor,et al.  Group theoretical methods in machine learning , 2008 .

[29]  Karl Tuyls,et al.  Evolutionary Dynamics of Multi-Agent Learning: A Survey , 2015, J. Artif. Intell. Res..

[30]  Pushmeet Kohli,et al.  Adversarial Risk and the Dangers of Evaluating Against Weak Attacks , 2018, ICML.

[31]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[32]  David Silver,et al.  A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning , 2017, NIPS.

[33]  Alexander J. Smola,et al.  Deep Sets , 2017, 1703.06114.

[34]  David Silver,et al.  Learning values across many orders of magnitude , 2016, NIPS.

[35]  Julian Togelius,et al.  A comparative evaluation of procedural level generators in the Mario AI framework , 2014, FDG.

[36]  Simon M. Lucas,et al.  Evolving mario levels in the latent space of a deep convolutional generative adversarial network , 2018, GECCO.

[37]  Jan Ramon,et al.  An evolutionary game-theoretic analysis of poker strategies , 2009, Entertain. Comput..

[38]  Kevin Leyton-Brown,et al.  Deep Models of Interactions Across Sets , 2018, ICML.

[39]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[40]  Asuman E. Ozdaglar,et al.  Near-Potential Games: Geometry and Dynamics , 2013, TEAC.

[41]  Asuman E. Ozdaglar,et al.  Flows and Decompositions of Games: Harmonic and Potential Games , 2010, Math. Oper. Res..

[42]  Yao Zhao,et al.  Adversarial Attacks and Defences Competition , 2018, ArXiv.

[43]  D. Meyer,et al.  Supporting Online Material Materials and Methods Som Text Figs. S1 to S6 References Evidence for a Collective Intelligence Factor in the Performance of Human Groups , 2022 .

[44]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[45]  Michael P. Wellman,et al.  Empirical game-theoretic analysis of the TAC Supply Chain game , 2007, AAMAS '07.

[46]  Ilya Kostrikov,et al.  Intrinsic Motivation and Automatic Curricula via Asymmetric Self-Play , 2017, ICLR.

[47]  Julian Togelius,et al.  General Video Game Evaluation Using Relative Algorithm Performance Profiles , 2015, EvoApplications.

[48]  Julian Togelius,et al.  Towards a Generic Method of Evaluating Game Levels , 2013, AIIDE.

[49]  Alex M. Andrew,et al.  Boosting: Foundations and Algorithms , 2012 .

[50]  Peter A. Flach,et al.  A Unified View of Performance Metrics: Translating Threshold Choice into Expected Classification Loss C` Esar Ferri , 2012 .

[51]  J. Nash Equilibrium Points in N-Person Games. , 1950, Proceedings of the National Academy of Sciences of the United States of America.

[52]  Elizabeth Sklar,et al.  Auctions, Evolution, and Multi-agent Learning , 2007, Adaptive Agents and Multi-Agents Systems.

[53]  R. Vandenberg,et al.  A Review and Synthesis of the Measurement Invariance Literature: Suggestions, Practices, and Recommendations for Organizational Research , 2000 .

[54]  Yuan Yao,et al.  Statistical ranking and combinatorial Hodge theory , 2008, Math. Program..

[55]  Zhen Lin,et al.  Clebsch-Gordan Nets: a Fully Fourier Space Spherical Convolutional Neural Network , 2018, NeurIPS.

[56]  A. Kolmogorov Three approaches to the quantitative definition of information , 1968 .

[57]  R. Hambleton,et al.  Fundamentals of Item Response Theory , 1991 .

[58]  Shane Legg,et al.  An Approximation of the Universal Intelligence Measure , 2011, Algorithmic Probability and Friends.

[59]  Peter McBurney,et al.  An evolutionary game-theoretic comparison of two double-auction market designs , 2004, AAMAS'04.

[60]  Ray J. Solomonoff,et al.  A Formal Theory of Inductive Inference. Part II , 1964, Inf. Control..

[61]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[62]  Joel Z. Leibo,et al.  A Generalised Method for Empirical Game Theoretic Analysis , 2018, AAMAS.

[63]  D. Hunter MM algorithms for generalized Bradley-Terry models , 2003 .

[64]  Julian Togelius,et al.  AI-based playtesting of contemporary board games , 2017, FDG.

[65]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[66]  Marlos C. Machado,et al.  Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents , 2017, J. Artif. Intell. Res..

[67]  Demis Hassabis,et al.  Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , 2017, ArXiv.

[68]  Tom Minka,et al.  TrueSkillTM: A Bayesian Skill Rating System , 2006, NIPS.

[69]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[70]  Marc G. Bellemare,et al.  Count-Based Exploration with Neural Density Models , 2017, ICML.

[71]  Dan Boneh,et al.  Ensemble Adversarial Training: Attacks and Defenses , 2017, ICLR.

[72]  Michael P. Wellman,et al.  Practical Strategic Reasoning with Applications in Market Games , 2010 .

[73]  William H. Sandholm,et al.  ON THE GLOBAL CONVERGENCE OF STOCHASTIC FICTITIOUS PLAY , 2002 .

[74]  Max Jaderberg,et al.  Population Based Training of Neural Networks , 2017, ArXiv.

[75]  Attila Szolnoki,et al.  Cyclic dominance in evolutionary games: a review , 2014, Journal of The Royal Society Interface.

[76]  Rajarshi Das,et al.  Choosing Samples to Compute Heuristic-Strategy Nash Equilibrium , 2003, AMEC.

[77]  José Hernández-Orallo,et al.  The Measure of All Minds: Evaluating Natural and Artificial Intelligence , 2017 .

[78]  Asuman E. Ozdaglar,et al.  Dynamics in near-potential games , 2010, 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[79]  D. Donoho 50 Years of Data Science , 2017 .

[80]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[81]  Katja Hofmann,et al.  Contextual Dueling Bandits , 2015, COLT.