Scaling of True and Apparent ROC AUC with Number of Observations and Number of Variables

ABSTRACT New technologies have recently emerged which enable simultaneous evaluation of large numbers of biological markers. The resultant marker data are often used to build predictive models which claim to be able to distinguish between two or more classes of subjects. However, when there are a large number of variables and a small number of observations, the problem of overfitting arises, where the model parameters are optimized for the observed data but may fit poorly for independent data. Here we illustrate how various quantities related to true and apparent predictive ability scale with the number of markers and the number of observations (subjects). Specifically, we utilize a model which takes the form of a linear combination of a subset of marker variables; the model produces a propensity score which generates an ROC curve and corresponding area under the ROC curve (AUC), which is a measure of predictive ability. Given the true marker distributions, there is a parameter value so that the resulting...