Multi-Model Comparison Using the Cross-Fitting Method Holger Schultheis (schulth@informatik.uni-bremen.de) Cognitive Systems, University of Bremen, Enrique-Schmidt-Str. 5, 28359 Bremen, Germany Praneeth Naidu Computer Science and Engineering, IIT Bombay, Mumbai 400076, India Abstract When comparing the ability of computational cognitive mod- els to fit empirical data, the complexity of the compared mod- els needs to be taken into account. A promising method for achieving this is the parametric bootstrap cross-fitting method (PBCM) proposed by Wagenmakers, Ratcliff, Gomez, and Iverson (2004). We contribute to a wider applicability of the PBCM in two ways: First, we compare the performance of the data-informed and the data-uninformed variant of the PBCM. Our simulations suggest that only the data-uninformed variant successfully controls for model complexity in model selection. Second, we propose an extension of the PBCM, called MMP- BCM, that is applicable to, in principle, arbitrarily many com- peting models. We evaluate the MMPBCM by applying it to the comparison of several sets of competing models. The ob- tained results suggest that the MMPBCM constitutes a more powerful approach to model comparison than the PBCM. Keywords: Model Evaluation, Multi-Model Comparison, Parametric Bootstrap Crossfitting Method. Introduction It is often considered an advantage of computational cog- nitive models that they allow generating data by simulation and this article concerns this type of data-generating models. One way to evaluate and compare such models is to gener- ate data from them and to compare the model-generated data to empirical data pertinent to the phenomenon that is being modeled. The degree of correspondence between the model- generated and the empirical data is often called the goodness of fit (GOF) and it may be used to assess the quality of the competing models: The higher the GOF, the better the model. However, such a na¨ive use of GOF measures for model comparison is problematic, because it neglects model com- plexity. Due to overfitting, more complex models may pro- vide high GOF measures solely by virtue of their complex- ity. As a result, the na¨ive use of GOF measures may lead to the selection of a more complex model even if a less com- plex model actually provides a better approximation to the processes that underlie the phenomenon that is being investi- gated (Pitt & Myung, 2002). To address this problem, a number of methods have been proposed that take model complexity into account when com- paring how well models can account for empirical data (see Shiffrin, Lee, Kim, & Wagenmakers, 2008; Schultheis, Sing- haniya, & Chaplot, 2013, for overviews). One of these meth- ods is the parametric bootstrap cross-fitting method (PBCM) proposed by Wagenmakers et al. (2004). Two properties of the PBCM render it particularly appealing for model eval- uation and selection: First, the PBCM is applicable to any type of model, since it imposes no constraints on the model- ing paradigm or the models’ structure. Second, if one of the compared models captures the actual processes that generated the to-be-fitted data, the PBCM has been considered to per- form optimally in selecting this model (Shiffrin et al., 2008; Cohen, Sanborn, & Shiffrin, 2008). Given these properties, employment of the PBCM instead of the na¨ive use of GOF measures seems highly desirable. At the same time, two aspects of the PBCM – as so far discussed in the literature – may hamper or even preclude use of the PBCM in certain modeling situations. For one, in the article introducing the PBCM, Wagenmakers et al. (2004) propose two different variants of the PBCM called the data-informed PBCM (DIPBCM) and the data-uninformed PBCM (DUP- BCM). Since these two variants differ considerably in their computational complexity, it would be important to know to what extent their performance in model comparison differs. Initial analyses presented in Wagenmakers et al. (2004) sug- gest that the DIPBCM may generally perform worse than the DUPBCM, but information that allows more detailedly quan- tifying potential differences between the two variants is cur- rently not available from the literature considering the PBCM (Wagenmakers et al., 2004; Cohen, Sanborn, & Shiffrin, 2008; Cohen, Rotello, & MacMillan, 2008; Jang, Wixted, & Huber, 2011; Perea, Gomez, & Fraga, 2010). Furthermore, both PBCM variants are currently restricted to the compari- son of pairs of models. When more than 2 competing models need to be compared, this comparison must be broken down to multiple comparisons of model pairs or the PBCM cannot be applied at all. In this article, we provide a first systematic quantitive com- parison of the DIPBCM and the DUPBCM regarding their model selection performance. We also propose and evalu- ate an extension of the PBCM that allows comparing more than two competing models. Both contributions facilitate the use of the PBCM and, thus, more generally, are conducive to increasing the frequency with which more sophisticated comparison methods instead of the na¨ive approach will be employed for model evaluation and comparison. The PBCM Let A and B be two competing models and x a set of observed data (e.g., response times from different experimental condi- x be the GOF difference of the tions). Furthermore, let ∆go f AB x = go f x − go f x , two models on the data set x, that is, ∆go f AB B A x x where go f A and go f B are the goodness of fits the models A and B achieve on x, respectively. A na¨ive approach to model x ≥ 0 and B otherwise. comparison would select A if ∆go f AB The PBCM aims to improve on the na¨ive approach by tak-
[1]
Ian H. Witten,et al.
The WEKA data mining software: an update
,
2009,
SKDD.
[2]
R. Tibshirani,et al.
Improvements on Cross-Validation: The 632+ Bootstrap Method
,
1997
.
[3]
Holger Schultheis.
Decision Criteria for Model Comparison Using Cross-Fitting
,
2013
.
[4]
C. Rotello,et al.
Evaluating models of remember-know judgments: Complexity, mimicry, and discriminability
,
2008,
Psychonomic bulletin & review.
[5]
I. J. Myung,et al.
When a good fit can be bad
,
2002,
Trends in Cognitive Sciences.
[6]
Michael D. Lee,et al.
A Survey of Model Evaluation Approaches With a Tutorial on Hierarchical Bayesian Methods
,
2008,
Cogn. Sci..
[7]
Isabel Fraga,et al.
In Masked Nonword Repetition Effects in Yes/no and Go/no-go Lexical Decision: a Test of the Evidence Accumulation and Deadline Accounts
,
2022
.
[8]
J. Wixted,et al.
The diagnosticity of individual data for model selection: Comparing signal-detection models of recognition memory
,
2011,
Psychonomic bulletin & review.
[9]
Holger Schultheis,et al.
Comparing Model Comparison Methods
,
2013,
CogSci.
[10]
Adam N. Sanborn,et al.
Model evaluation using grouped or individual data
,
2008,
Psychonomic bulletin & review.
[11]
Roger Ratcliff,et al.
Assessing model mimicry using the parametric bootstrap
,
2004
.