The Problem of Overfitting

Model fitting is an important part of all sciences that use quantitative measurements. Experimenters often explore the relationships between measures. Two subclasses of relationship problems are as follows: • Correlation problems: those in which we have a collection of measures, all of interest in their own right, and wish to see how and how strongly they are related. • Regression problems : those in which one of the measures, the dependent Variable, is of special interest, and we wish to explore its relationship with the other variables. These other variables may be called the independent Variables, the predictor Variables, or the coVariates. The dependent variable may be a continuous numeric measure such as a boiling point or a categorical measure such as a classification into mutagenic and nonmutagenic. We should emphasize that using the words ‘correlation problem’ and ‘regression problem’ is not meant to tie these problems to any particular statistical methodology. Having a ‘correlation problem’ does not limit us to conventional Pearson correlation coefficients. Log-linear models, for example, measure the relationship between categorical variables in multiway contingency tables. Similarly, multiple linear regression is a methodology useful for regression problems, but so also are nonlinear regression, neural nets, recursive partitioning and k-nearest neighbors, logistic regression, support vector machines and discriminant analysis, to mention a few. All of these methods aim to quantify the relationship between the predictors and the dependent variable. We will use the term ‘regression problem’ in this conceptual form and, when we want to specialize to multiple linear regression using ordinary least squares, will describe it as ‘OLS regression’. Our focus is on regression problems. We will use y as shorthand for the dependent variable and x for the collection of predictors available. There are two distinct primary settings in which we might want to do a regression study: • Prediction problems:We may want to make predictions of y for future cases where we know x but do not knowy. This for example is the problem faced with the Toxic Substances Control Act (TSCA) list. This list contains many tens of thousands of compounds, and there is a need to identify those on the list that are potentially harmful. Only a small fraction of the list however has any measured biological properties, but all of them can be characterized by chemical descriptors with relative ease. Using quantitative structure-activity relationships (QSARs) fitted to this small fraction to predict the toxicities of the much larger collection is a potentially cost-effective way to try to sort the TSCA compounds by their potential for harm. Later, we will use a data set for predicting the boiling point of a set of compounds on the TSCA list from some molecular descriptors. • Effect quantification:We may want to gain an understanding of how the predictors enter into the relationship that predicts y. We do not necessarily have candidate future unknowns that we want to predict, we simply want to know how each predictor drives the distribution of y. This is the setting seen in drug discovery, where the biological activity y of each in a collection of compounds is measured, along with molecular descriptors x. Finding out which descriptors x are associated with high and which with low biological activity leads to a recipe for new compounds which are high in the features associated positively with activity and low in those associated with inactivity or with adverse side effects. These two objectives are not always best served by the same approaches. ‘Feature selection’ skeeping those features associated withy and ignoring those not associated with y is very commonly a part of an analysis meant for effect quantification but is not necessarily helpful if the objective is prediction of future unknowns. For prediction, methods such as partial least squares (PLS) and ridge regression (RR) that retain all features but rein in their contributions are often found to be more effective than those relying on feature selection. What Is Overfitting? Occam’s Razor, or the principle of parsimony, calls for using models and procedures that contain all that is necessary for the modeling but nothing more. For example, if a regression model with 2 predictors is enough to explainy, then no more than these two predictors should be used. Going further, if the relationship can be captured by a linear function in these two predictors (which is described by 3 numbers sthe intercept and two slopes), then using a quadratic violates parsimony. Overfitting is the use of models or procedures that violate parsimonysthat is, that include more terms than are necessary or use more complicated approaches than are necessary. It is helpful to distinguish two types of overfitting: • Using a model that is more flexible than it needs to be. For example, a neural net is able to accommodate some curvilinear relationships and so is more flexible than a simple linear regression. But if it is used on a data set that conforms to the linear model, it will add a level of complexity without * Corresponding author e-mail: doug@stat.umn.edu. 1 J. Chem. Inf. Comput. Sci. 2004,44, 1-12