Factors that Influence the Value of the Coefficient of Determination in Simple Linear and Nonlinear Regression Models
暂无分享,去创建一个
Cornell, J. A., and Berger, R. D. 1987. Factors that influence the value of the coefficient of determination in simple linear and nonlinear regression models. Phytopathology 77:63-70. In the fitting of linear regression equations, the coefficient of standard error of the observations. In nonlinear model fitting, the value of determination (R) is one of the most widely used statistics to assess the R 2 is best determined by calculating the proportion of the total variation in goodness-of-fit of the equation. Its value, however, is affected by several the observations that cannot be explained by the fitted model and factors, some of which are associated more closely with the data collection subtracting this proportion from one. Several statistics that are analogous scheme or the experimental design than with how close the regression to the standard formula for R in the linear regression case are given and equation actually fits the observations. These design factors are: the range determined to be inappropriate in the nonlinear case. The use of R alone as of values of the independent variable (X), the arrangement of X values a model-fitting criterion is often risky and other statistics should be used to within the range, the number of replicate observations ( 1), and the variation assess the goodness of the model when responses from quantitative among the Y values at each value of X. Another little-known fact is the treatments are analyzed by regression techniques. effect on R of the ratio of the slope of the fitted equation to the estimated Additional key words: coefficient of determination, residuals, standard error. Linear regression is a commonly used statistical analysis in plant value often are contrary to the principles of good experimental pathology. It has been used, for example, to determine inoculum design. density/disease intensity relationships (5), survival of pathogens We shall answer the second question by listing several analogous over time (16), growth, sporulation, and infection of pathogens statistics to R 2 that are sometimes provided by current computer under different environments (9,10), model testing (8), and disease programs for regression analysis. intensity/crop loss relationships (1). Nonlinear regression is used frequently to fit disease proportions over time to various growth METHODS models (2,12), disease prediction from environmental parameters (8), crop loss estimation from disease intensity (13), growth, Artificial data sets were generated and linear or nonlinear models sporulation, and infection of a pathogen with temperature (3), and were fitted by least-squares regression either by hand calculation or the relationship of disease intensity to size of experimental plots (7) by the Statistical Analysis System package (15), using the facilities or to calcium carbonate concentration (4). of the Northeast Regional Data Center of the State University For both linear and nonlinear regression, the coefficient of System of Florida in Gainesville. determination is possibly the statistic used most often to assess the goodness-of-fit of empirical models fitted to data. This is because RESULTS AND DISCUSSION the value of R 2 is provided by every current computer program for regression analysis. Nearly every published article, in which Factors that affect R 2 in the fitting of simple linear regression regression analysis was performed, lists the R associated with each equations. In the simple linear regression equation, Y1 = a + bXi + equation fitted. The appropriateness of R to assess the goodness of ej, Y. is the ith observation of the dependent variable and Xi is the a fitted model is under investigation (11) and, until alternative value of the independent variable at which Yi is observed. The measures are suggested, it is imperative that the meaning of R and quantities a and b are unknown parameters that represent the the factors that influence it be understood. intercept and slope of the regression line, respectively. The random In the fitting of regression models, researchers occasionally raise error associated with Y1 is termed ej. The usual assumptions one or the other of the following two questions when they discover regarding the errors are, that in a population of N values of Y,, the the value of R 2 is extremely low for their model: Why is R 2 so low random errors (el) have zero mean, a common variance (o2), and when the equation seems to fit the data very well? What is the are independent of one another. appropriate method to calculate R 2 to determine the goodness-ofTo illustrate the calculations that are required in the analysis of a fit of a nonlinear model, e.g., exponential models or power fitted regression equation, the simple linear regression equation ( Y functions? = a + bX, + ei) is fitted to each of two data sets denoted as A and B. In this article, to address the first question we identify some of the The observations (Y) are the same in data sets A and B but the factors in a data set that lower the value of R. Our purpose in ranges of Xi are different (Table 1). The plots of the fitted regression singling out these factors is twofold: first, to acquaint users of equations are shown in Figure 1. regression techniques of the potential pitfalls that result from Included among the entries in Table 1 are the predicted responses relying too heavily on R2 as a model closeness criterion, and (Y•) at each Xj^obtained with the fitted regression equation. The second, to point out that corrective actions to obtain a high R 2 quantity Yi Yi, represents the difference between the" observed The publication costs of this article were defrayed in part by page charge payment. This value (Y) and the predicted value (Y) at Xj, and this difference is article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. called the residual corresponding to the it observation. The larger §_1734 solely to indicate this fact. the values of the residuals, the less confident one feels about how ©1987 The American Phytopathological Society well the estimated equation fits the observed values. A numerical Vol. 77, No. 1, 1987 63 measure, therefore, of how well the model actually fits the data is containing the variable Xis employed. A reasonable measure of the the variance of the residuals, s2 = SSE/ (N2). When the residuals effect of Xin explaining the variation in Yis R , calculated either as are large, s2 is large. Also, the individual residuals can be plotted against the values of X or the values of Yj to ascertain if the linear R SSE/ SSr or R= (SS SSE)/ SS y (1) model is indeed the appropriate choice. In both plots, if the model is correct, the values of the residuals will exhibit random scatter The quantity (SS rSSE) is equal to bSSxy and is the regression about the line, YY = 0, and the approximate scatter is uniform sums of squares (S.S. Regression), that is, the variation in the Y, for all values of X and/or Y1. values explained, or accounted for, by the fitted regression The positive square root Of S2 (i.e., se) is called the estimated equation Yi a + bXj. standard error of the Y. values about the regression line (6), that is, S= X/ Y( Yi) 2 / (N2) is a function of the residuals, Y1 -j, about the regression equation. Thus Se represents a measure of the R B 0.8575 R 0.9029 error with which any observed value of Y is selected from the 50 .o + 5.68x Y 10-.3 + 3.04X% distribution of Y values at each value of X. A In Figure 1, the two plots ofXand Yvalues for sets A and B differ 0 A only in the spread or range of Xv. In set A, the range is 13 1 = 12 408 A units, whereas in set B the range is 8 1 = 7 units. The different ranges of Xi result in different estimates both for a and b in the two fitted regression equations and also different estimates of the error Y 30 _ variance (s). Because R = 0.9629 for the fitted equation with data set A is higher than R = 0.8575 with data set B, in spite of the fact that the estimated slope, b, is larger with set B than with set A, we 20 are led to believe the equation Y= 10.63 + 3.04Xi fits the data in set A.8 A.B A better than the equation Y1 = 6.0 + 5.58Xi fits the data in set B. A.B Before we can determine if indeed this is the case, we need to define 10 R2 Definition of R . In the calculation of the summary statistics (Table 1), the quantity SS_ is a measure of the variation in the Y, 0.0 L _L 2 4 6 10 12 14 values about their mean, Y. In other words, SS y is a measure of the X uncertainty in predicting Y without taking X into consideration. Fig. 1. Linear regression equation fitted to two data sets (A and B) with Similarly, SSE is a measure of the variation in the values of Y1, or identical Y values. The different ranges of X cause different estimates of the uncertainty in predicting Y, when a regression model slopes, intercepts, and R values. TABLE 1. Calculations needed to obtain the fitted regression equations and other summary statistics for two data sets Observations Data set A Data set B Yi Xi YiY XiS YYi Xi XiX Yy r,Yi 15 1 -16.125 -5.75 13.67 1.33 1 -3.5 11.58 3.42 17 2 -14.125 -4.75 16.70 0.30 2 -2.5 17.17 0.17 20 3 -11.125 -3.75 19.74 0.26 3 -1.5 22.75 2.75 18 4 -13.125 -2.75 22.78 -4.78 4 -0.5 28.33 -10.33 43 9 11.875 2.25 37.96 5.04 5 0.5 33.92 9.08 42 10 10.875 3.25 40.99 1.01 6 1.5 39.50 2.50 45 12 13.875 5.25 47.06 -2.06 7 2.5 45.08 0.08 49 13 17.875 6.25 50.10 -1.10 8 3.5 50.67 1.67 249.0 249.0 Y 31.125 31.125 ,•Xi 54.0 36.0 X 6.75 4.5 y( y ) = SS Y = 1,526.875 1,526.875 I(X=•_) SS x= 159.50 42.0 X(X(Y) Y)SSXY 484.25 234.5 y(y. yi) = SSE 56.67 217.58 Estimate of the intercept = Y bX = 10.63 6.00 Estimate of the slope b = SSxy / SSx = 3.04 5.58 S. S. Regression SS ySSE= bSSxy= 1,470.21 1,309.29 Coefficient of determination R'= 1 SSE/SSr= 0.9629 0.8575 Estimate of oe: s= SSE/(N2)= 9.44 36.26 Slope/standard error se 0.9894 0.9267 Fitted regression equation Y 10.63 + 3.04X; j 1 = 6.00 + 5.58Xi
[1] T. O. Kvålseth. Cautionary Note about R 2 , 1985 .
[2] H. Dillard,et al. Relationship between sclerotial spatial pattern and density of Sclerotinia minor and the incidence of lettuce drop , 1985 .
[3] M. Jeger,et al. Optimizing Plot Size for Field Studies of Phymatotrichum Root Rot of Cotton , 1985 .