Making Sense of Item Response Theory in Machine Learning

Item response theory (IRT) is widely used to measure latent abilities of subjects (specially for educational testing) based on their responses to items with different levels of difficulty. The adaptation of IRT has been recently suggested as a novel perspective for a better understanding of the results of machine learning experiments and, by extension, other artificial intelligence experiments. For instance, IRT suits classification tasks perfectly, where instances correspond to items and classifiers correspond to subjects. By adopting IRT, item (i.e., instance) characteristic curves can be estimated using logistic models, for which several parameters characterise each dataset instance: difficulty, discrimination and guessing. IRT looks promising for the analysis of instance hardness, noise, classifier dominances, etc. However, some caveats have been found when trying to interpret the IRT parameters in a machine learning setting, especially when we include some artificial classifiers in the pool of classifiers to be evaluated: the optimal and pessimal classifiers, a random classifier and the majority and minority classifiers. In this paper we perform a series of experiments with a range of datasets and classification methods to fully understand how IRT works and what their parameters really mean in the context of machine learning. This better understanding will hopefully pave the way to a myriad of potential applications in machine learning and artificial intelligence.