Quantifying epidemiologic risk factors using non‐parametric regression: model selection remains the greatest challenge

Logistic regression is widely used to estimate relative risks (odds ratios) from case–control studies, but when the study exposure is continuous, standard parametric models may not accurately characterize the exposure–response curve. Semi‐parametric generalized linear models provide a useful extension. In these models, the exposure of interest is modelled flexibly using a regression spline or a smoothing spline, while other variables are modelled using conventional methods. When coupled with a model‐selection procedure based on minimizing a cross‐validation score, this approach provides a non‐parametric, objective, and reproducible method to characterize the exposure–response curve by one or several models with a favourable bias–variance trade‐off. We applied this approach to case–control data to estimate the dose–response relationship between alcohol consumption and risk of oral cancer among African Americans. We did not find a uniquely ‘best’ model, but results using linear, cubic, and smoothing splines were consistent: there does not appear to be a risk‐free threshold for alcohol consumption vis‐à‐vis the development of oral cancer. This finding was not apparent using a standard step‐function model. In our analysis, the cross‐validation curve had a global minimum and also a local minimum. In general, the phenomenon of multiple local minima makes it more difficult to interpret the results, and may present a computational roadblock to non‐parametric generalized additive models of multiple continuous exposures. Nonetheless, the semi‐parametric approach appears to be a practical advance. Published in 2003 by John Wiley & Sons, Ltd.

[1]  A. Figueiras,et al.  Application of nonparametric models for calculating odds ratios and their confidence intervals for continuous exposures. , 2001, American journal of epidemiology.

[2]  D. Easton,et al.  Re: "Presenting statistical uncertainty in trends and dose-response relations". , 2000, American journal of epidemiology.

[3]  J. Robins,et al.  Presenting statistical uncertainty in trends and dose-response relations. , 1999, American journal of epidemiology.

[4]  R du Berger,et al.  Flexible modeling of the effects of serum cholesterol on coronary heart disease mortality. , 1997, American journal of epidemiology.

[5]  S. Greenland Dose‐Response and Trend Analysis in Epidemiology: Alternatives to Categorical Analysis , 1995, Epidemiology.

[6]  C R Weinberg,et al.  How bad is categorization? , 1995, Epidemiology.

[7]  D. Winn,et al.  Racial differences in risk of oral and pharyngeal cancer: alcohol, tobacco, and other determinants. , 1993, Journal of the National Cancer Institute.

[8]  James Stephen Marron,et al.  Local minima in cross validation functions , 1991 .

[9]  M. Hutchinson,et al.  Smoothing noisy data with spline functions , 1985 .

[10]  S. Greenhouse Some epidemiologic issues for the 1980s. , 1980, American journal of epidemiology.

[11]  Patricia L. Smith Splines as a Useful and Convenient Statistical Tool , 1979 .

[12]  M. Stone An Asymptotic Equivalence of Choice of Model by Cross‐Validation and Akaike's Criterion , 1977 .

[13]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[14]  S. Greenhouse,et al.  Multiple relative risk functions in case-control studies. , 1973, American journal of epidemiology.

[15]  C. Reinsch Smoothing by spline functions , 1967 .

[16]  B. Silverman,et al.  Nonparametric regression and generalized linear models , 1994 .

[17]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .