Dual Indicators to Analyze AI Benchmarks: Difficulty, Discrimination, Ability, and Generality

With the purpose of better analyzing the result of artificial intelligence (AI) benchmarks, we present two indicators on the side of the AI problems, difficulty and discrimination, and two indicators on the side of the AI systems, ability and generality. The first three are adapted from psychometric models in item response theory (IRT), whereas generality is defined as a new metric that evaluates whether an agent is consistently good at easy problems and bad at difficult ones. We illustrate how these key indicators give us more insight on the results of two popular benchmarks in AI, the Arcade Learning Environment (Atari 2600 games) and the General Video Game AI competition, and we include some guidelines to estimate and interpret these indicators for other AI benchmarks and competitions.

[1]  Julian Togelius,et al.  Matching Games and Algorithms for General Video Game Playing , 2021, AIIDE.

[2]  Simon M. Lucas,et al.  Rolling horizon evolution versus tree search for navigation in single-player real-time games , 2013, GECCO '13.

[3]  Santiago Ontañón,et al.  A Survey of Real-Time Strategy Game AI Research and Competition in StarCraft , 2013, IEEE Transactions on Computational Intelligence and AI in Games.

[4]  Julian Togelius,et al.  Towards generating arcade game rules with VGDL , 2015, 2015 IEEE Conference on Computational Intelligence and Games (CIG).

[5]  Julian Togelius,et al.  Ieee Transactions on Computational Intelligence and Ai in Games the 2014 General Video Game Playing Competition , 2022 .

[6]  Carmel Domshlak,et al.  Blind Search for Atari-Like Online Planning Revisited , 2016, IJCAI.

[7]  José Hernández-Orallo,et al.  Analysis of instance hardness in machine learning using item response theory , 2015 .

[8]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[9]  Julian Togelius,et al.  Multi-objective Adaptation of a Parameterized GVGAI Agent Towards Several Games , 2017, EMO.

[10]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[11]  J. Hernández-Orallo,et al.  AI results for the Atari 2600 games : difficulty and discrimination using IRT , 2017 .

[12]  P. Fayers Item Response Theory for Psychologists , 2004, Quality of Life Research.

[13]  Julian Togelius,et al.  A Panorama of Artificial and Computational Intelligence in Games , 2015, IEEE Transactions on Computational Intelligence and AI in Games.

[14]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[15]  Murray Campbell,et al.  Deep Blue , 2002, Artif. Intell..

[16]  Melvin R. Novick,et al.  Some latent train models and their use in inferring an examinee's ability , 1966 .

[17]  P. Hingston Believable Bots: Can Computers Play Like People? , 2012 .

[18]  Yang Liu,et al.  Learning to Play in a Day: Faster Deep Reinforcement Learning by Optimality Tightening , 2016, ICLR.

[19]  Yavar Naddaf,et al.  Game-independent AI agents for playing Atari 2600 console games , 2010 .

[20]  Hector Geffner,et al.  Classical Planning with Simulators: Results on the Atari Video Games , 2015, IJCAI.

[21]  Hector Geffner,et al.  Width and Serialization of Classical Planning Problems , 2012, ECAI.

[22]  Daniel Furelos Blanco Learning and Generalization in Atari Games , 2015 .

[23]  José Hernández-Orallo,et al.  The Measure of All Minds: Evaluating Natural and Artificial Intelligence , 2017 .

[24]  Arthur L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[25]  José Hernández-Orallo,et al.  A New AI Evaluation Cosmos: Ready to Play the Game? , 2017, AI Mag..

[26]  José Hernández-Orallo,et al.  Analysing Results from AI Benchmarks: Key Indicators and How to Obtain Them , 2018, ArXiv.

[27]  David H. Wolpert,et al.  Ubiquity symposium: Evolutionary computation and the processes of life: what the no free lunch theorems really mean: how to improve search algorithms , 2013, UBIQ.

[28]  Michael R. Genesereth,et al.  General Game Playing: Overview of the AAAI Competition , 2005, AI Mag..

[29]  José Hernández-Orallo,et al.  Evaluation in artificial intelligence: from task-oriented to ability-oriented measurement , 2017, Artificial Intelligence Review.

[30]  Jochen Renz AIBIRDS: The Angry Birds Artificial Intelligence Competition , 2015, AAAI.

[31]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[32]  José Hernández-Orallo,et al.  Measuring universal intelligence: Towards an anytime intelligence test , 2010, Artif. Intell..

[33]  Xi Chen,et al.  Evolution Strategies as a Scalable Alternative to Reinforcement Learning , 2017, ArXiv.

[34]  Shane Legg,et al.  Massively Parallel Methods for Deep Reinforcement Learning , 2015, ArXiv.

[35]  Adolfo Martínez Usó,et al.  Making Sense of Item Response Theory in Machine Learning , 2016, ECAI.

[36]  Marc G. Bellemare,et al.  The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning , 2017, ICLR.

[37]  David H. Wolpert,et al.  No free lunch theorems for optimization , 1997, IEEE Trans. Evol. Comput..

[38]  Erik Talvitie,et al.  Pairwise Relative Offset Features for Atari 2600 Games , 2015, AAAI Workshop: Learning for General Competency in Video Games.

[39]  Thore Graepel,et al.  Re-evaluating evaluation , 2018, NeurIPS.

[40]  Demis Hassabis,et al.  Neural Episodic Control , 2017, ICML.

[41]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[42]  Erik T. Mueller,et al.  Watson: Beyond Jeopardy! , 2013, Artif. Intell..

[43]  Simon M. Lucas,et al.  A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[44]  Julian Togelius,et al.  The Mario AI Championship , 2010, Proceedings of the 2010 IEEE Conference on Computational Intelligence and Games.

[45]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[46]  Albert Maydeu-Olivares Goodness-of-Fit Assessment of Item Response Theory Models , 2013 .

[47]  Julian Togelius,et al.  The Mario AI Championship 2009-2012 , 2013, AI Mag..

[48]  Lukás Chrpa,et al.  The 2014 International Planning Competition: Progress and Trends , 2015, AI Mag..

[49]  Daniel A. Ashlock,et al.  General video game playing escapes the no free lunch theorem , 2017, 2017 IEEE Conference on Computational Intelligence and Games (CIG).

[50]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[51]  Davide Castelvecchi Tech giants open virtual worlds to bevy of AI programs , 2016, Nature.