Interpretation of regression coefficients under a latent variable regression model

In standard linear regression where the predictor matrix X is of full rank, the regression coefficients are clearly defined as the parameters B appearing in the linear regression model. In latent variable models there is no direct relationship between the predictor variables and response variables. Rather they are both related to an underlying reduced‐rank set of latent variables. Recent papers have proposed different methods for obtaining approximate covariance matrices for the estimates of the regression coefficients from methods such as partial least squares (PLS) and for using them to determine ‘confidence intervals’, for variable selection and for judging variable importance. However, in the latent variable model a matrix of regression coefficients, B, does not even appear as a parameter matrix. In the situation where the data follow such a model, it is therefore uncertain how the regression coefficients and, by extension, any covariance matrices and ‘confidence intervals’ should be interpreted. In this paper we show that any inference is critically dependent upon how one defines these regression coefficients. Two definitions for the regression coefficients are given that are consistent with the latent variable model. Which of these definitions is more relevant is shown to be highly dependent on the goals of the analysis. Therefore one must be clear on the definition one is using for these coefficients when building predictive models, when screening variables based on them or when using them to make interpretations about the system. Under standard normality assumptions, different estimation methods such as ordinary least squares (OLS) and PLS are shown to provide very different distributions for the regression coefficient estimates when the data follow a latent variable model. This is shown to be not just a matter of the PLS coefficients being biased or the OLS estimates having large variance, but of more complex differences implied by the structure of the model parameters in the latent variable model. How the distributions for these estimates relate to the definitions given in this paper is explored here. It is shown for a simple case that the relative size of the PLS estimates, on average, tends to reflect the latent variable loadings, whereas the relative size of the OLS estimates, on average, is a function not only of the loadings but also of the error variances for the predictor variables. Thus in this particular case it appears that the relative size of the B parameters from PLS reflects the underlying latent structure, whereas those from OLS also reflect the error structure of the predictor variables. Copyright © 2001 John Wiley & Sons, Ltd.