The Choice of Variables in Observational Studies

SUMMARY A review is given of considerations affecting the choice of explanatory variables in observational studies. Aspects of both design and analysis are considered. In particular the choice of explanatory variables in multiple regression is discussed and some recommendations made. THIS paper reviews some general aspects of the choice of variables in observational studies. To keep the paper concise only outline examples have been included and to be specific these are medical, although the ideas apply widely. Observational studies, where they are not purely descriptive, have as their objective the explanation or prediction of some response in terms of explanatory or predictor variables. It is useful to have two examples in mind. Example 1. Consider an investigation into the incidence of a respiratory disease among a certain group of workers. The response variable may be severity of the disease, with possible explanatory variables being the worker's age, physical status, working conditions, previous employment, etc. Some variables may be more important than others in explaining the severity of the disease. Example 2. A different situation is one of trying to predict the time to death among patients known to be suffering from a progressive and fatal disease. Possible predictive variables are type of treatment, treatment variables such as dose, clinical and biochemical measurements made on diagnosis, etc. Although careful discussion of the most appropriate way to measure response is always important, and often several different measures will be called for, nevertheless what response variables to consider is frequently fairly clearcut. Thus in Example 1, severity may be assessed radiologically and graded according to standard levels. In Example 2, time to death is likely to be measured from time of diagnosis. In this paper we concentrate on the explanatory variables; how many such variables should be measured and, if many are observed, how should the analysis be handled to find the

[1]  Malcolm E. Turner,et al.  The Regression Analysis of Causal Paths , 1959 .

[2]  S. F. Buck A Method of Estimation of Missing Values in Multivariate Data Suitable for Use with an Electronic Computer , 1960 .

[3]  H. C. Hamaker On multiple regression analysis , 1962 .

[4]  J. Oosterhoff On the selection of independent variables in a regression equation : Preliminary report , 1963 .

[5]  W. G. Cochran The Planning of Observational Studies of Human Populations , 1965 .

[6]  M. J. Garside,et al.  The Best Sub‐Set in Multiple Regression Analysis , 1965 .

[7]  Otto Dykstra,et al.  The Orthogonalization of Undesigned Experiments , 1966 .

[8]  J. W. Gorman,et al.  Selection of Variables for Fitting Equations to Data , 1966 .

[9]  N. Draper,et al.  Applied Regression Analysis , 1966 .

[10]  R. R. Hocking,et al.  Selection of the Best Subset in Regression Analysis , 1967 .

[11]  D. J. Spurrell,et al.  A Development of Multiple Regression for the Analysis of Routine Data , 1967 .

[12]  M. Kendall,et al.  The discarding of variables in multivariate analysis. , 1967, Biometrika.

[13]  Klaus Abt,et al.  On the identification of the significant independent variables in linear models , 1967 .

[14]  J. N. R. Jeffers,et al.  Two Case Studies in the Application of Principal Component Analysis , 1967 .

[15]  Harold J. Breaux,et al.  A modification of Efroymson's technique for stepwise regression analysis , 1968, Commun. ACM.

[16]  J. B. Pearson,et al.  Methodology in Social Research. , 1968 .

[17]  D. W. Gaylor,et al.  Augmenting Existing Data in Multiple Regression , 1968 .

[18]  D. Cox,et al.  Notes on Some Aspects of Regression Analysis , 1968 .

[19]  D. Lindley The Choice of Variables in Multiple Regression , 1968 .

[20]  S. Fienberg,et al.  Efficient Calculation of All Possible Regressions , 1968 .

[21]  N. Mantel Why Stepdown Procedures in Variable Selection , 1970 .

[22]  Stephen E. Fienberg,et al.  The Analysis of Multidimensional Contingency Tables , 1970 .

[23]  Donald W. Marquaridt Generalized Inverses, Ridge Regression, Biased Linear Estimation, and Nonlinear Estimation , 1970 .

[24]  Roy H. Wampler A Report on the Accuracy of Some Widely Used Least Squares Computer Programs , 1970 .

[25]  E. Beale,et al.  Note on Procedures for Variable Selection in Multiple Regression , 1970 .

[26]  Gary M. Mullet,et al.  A New Method for Examining Rounding Error in Least-Squares Regression Computer Programs , 1971 .

[27]  L. A. Goodman The Analysis of Multidimensional Contingency Tables: Stepwise Procedures and Direct Estimation Methods for Building Models for Multiple Classifications , 1971 .

[28]  Irwin Guttman,et al.  The distribution of certain regression statistics , 1971 .

[29]  R. R. Hocking,et al.  The analysis of incomplete data. , 1971 .

[30]  M. Dagenais Further Suggestions concerning the Utilization of Incomplete Observations in Regression Analysis , 1971 .

[31]  M. Woodbury A missing information principle: theory and applications , 1972 .

[32]  J. T. Webster,et al.  The Use of an F-Statistic in Stepwise Regression Procedures , 1972 .

[33]  E. Spjøtvoll MULTIPLE COMPARISON OF REGRESSION FUNCTIONS. , 1972 .

[34]  Benee F. Swindel,et al.  Rounding Errors in the Independent Variables in a General Linear Model , 1972 .

[35]  J. A. Morgan,et al.  Calculation of the Residual Sum of Squares for all Possible Regressions , 1972 .

[36]  J. W. Gorman,et al.  Fitting Equations to Data. , 1973 .