Multivariate Nonparametric Regression

As in many areas of biostatistics, oncological problems often have multivariate predictors. While assuming a linear additive model is convenient and straightforward, it is often not satisfactory when the relation between the outcome measure and the predictors is either nonlinear or nonadditive. In addition, when the number of predictors becomes (much) larger than the number of independent observations, as is the case for many new genomic technologies, it is impossible to fit standard linear models. In this chapter, we provide a brief overview of some multivariate nonparametric methods, such as regression trees and splines, and we describe how those methods are related to traditional linear models. Variable selection (discussed in Chapter 2) is a critical ingredient of the nonparametric regression methods discussed here; being able to compute accurate prediction errors (Chapter 4) is of critical importance in nonparametric regression; when the number of predictors increases substantially, approaches such as bagging and boosting (Chapter 5) are often essential. There are close connections between the methods discussed in Chapter 5 and some of the methods discussed in Section 3.8.2. In this chapter, we will briefly revisit those topics, but we refer to the respective chapters for more details. Support vector machines (Chapter 6), which are not discussed in this chapter, offer another approach to nonparametric regression. We start this chapter by discussing an example that we will use throughout the chapter. In Section 3.2 we discuss linear and additive models. In Section 3.3 we generalize these models by allowing for interaction effects. In Section 3.4 we discuss basis function expansions, which is a form in which many nonparametric regression methods, such as regression trees (Section 3.5), splines (Section 3.6) and logic regression (Section 3.7) can be written. In Section 3.8 we discuss the situation in which the predictor space is high dimensional. We conclude the chapter with discussing some issues pertinent to survival data (Section 3.9) and a brief general discussion (Section 3.10).

[1]  R. Tibshirani,et al.  Generalized additive models for medical research , 1986, Statistical methods in medical research.

[2]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[3]  D. Elder,et al.  Identification of high-risk patients among those diagnosed with thin cutaneous melanomas. , 2007, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[4]  Carl de Boor,et al.  A Practical Guide to Splines , 1978, Applied Mathematical Sciences.

[5]  J. Crowley,et al.  International staging system for multiple myeloma. , 2005, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[6]  David R. Cox,et al.  Regression models and life tables (with discussion , 1972 .

[7]  Calyampudi R. Rao,et al.  Linear Statistical Inference and Its Applications. , 1975 .

[8]  R. Tibshirani,et al.  Supervised harvesting of expression trees , 2001, Genome Biology.

[9]  Trevor Hastie,et al.  Polynomial splines and their tensor products in extended linear modeling. Discussion and rejoinder , 1997 .

[10]  J. Friedman Multivariate adaptive regression splines , 1990 .

[11]  K. Taylor,et al.  Genome-Wide Association , 2007, Diabetes.

[12]  B. Silverman,et al.  Nonparametric regression and generalized linear models , 1994 .

[13]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[14]  Nicholas I. Fisher,et al.  Bump hunting in high-dimensional data , 1999, Stat. Comput..

[15]  M. LeBlanc,et al.  Survival Trees by Goodness of Split , 1993 .

[16]  C. R. Rao,et al.  Linear Statistical Inference and its Applications , 1968 .

[17]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[18]  H. Akaike A new look at the statistical model identification , 1974 .

[19]  L. Staudt,et al.  The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. , 2002, The New England journal of medicine.

[20]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[21]  C Quantin,et al.  Variation over time of the effects of prognostic factors in a population-based study of colon cancer: comparison of statistical models. , 1999, American journal of epidemiology.

[22]  John Crowley,et al.  Total therapy 2 without thalidomide in comparison with total therapy 1: role of intensified induction and posttransplantation consolidation therapies. , 2006, Blood.

[23]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[24]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[25]  A. Ciampi,et al.  Stratification by stepwise regression, correspondence analysis and recursive partition: A comparison of three methods of analysis for survival data with covaria , 1986 .

[26]  C. J. Stone,et al.  The Use of Polynomial Splines and Their Tensor Products in Multivariate Function Estimation , 1994 .

[27]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[28]  Paul H. C. Eilers,et al.  Flexible smoothing with B-splines and penalties , 1996 .

[29]  Paul Fearnhead,et al.  Genome-wide association study of prostate cancer identifies a second risk locus at 8q24. Yeager M, Orr N, Hayes RB, Jacobs KB, Kraft , 2007 .

[30]  Peter Bühlmann,et al.  Supervised clustering of genes , 2002, Genome Biology.

[31]  Lester L. Peters,et al.  Genome-wide association study identifies novel breast cancer susceptibility loci , 2007, Nature.

[32]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[33]  Young K. Truong,et al.  Polynomial splines and their tensor products in extended linear modeling: 1994 Wald memorial lecture , 1997 .

[34]  Emily Singer Personalized medicine prompts push to redesign clinical trials , 2005, Nature Medicine.

[35]  K. Hanna,et al.  Cancer and the Environment: Gene-Enviroment Interaction , 2002 .

[36]  James Y Dai,et al.  Semiparametric Estimation Exploiting Covariate Independence in Two‐Phase Randomized Trials , 2009, Biometrics.

[37]  M. Wand Local Regression and Likelihood , 2001 .

[38]  Yi Ning,et al.  Pretreatment cytogenetics add to other prognostic factors predicting complete remission and long-term outcome in patients 60 years of age or older with acute myeloid leukemia: results from Cancer and Leukemia Group B 8461. , 2006, Blood.

[39]  F. O’Sullivan Fast Computation of Fully Automated Log-Density and Log-Hazard Estimators , 1988 .

[40]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[41]  C. la Vecchia,et al.  Estimating dose‐response relationship between ethanol and risk of cancer using regression spline models , 2005, International journal of cancer.

[42]  R B Davis,et al.  Exponential survival trees. , 1989, Statistics in medicine.

[43]  R. Olshen,et al.  Tree-structured survival analysis. , 1985, Cancer treatment reports.

[44]  B. Silverman,et al.  Nonparametric Regression and Generalized Linear Models: A roughness penalty approach , 1993 .

[45]  M. R. Osborne,et al.  On the LASSO and its Dual , 2000 .

[46]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[47]  D. Cox Regression Models and Life-Tables , 1972 .

[48]  P. Fearnhead,et al.  Genome-wide association study of prostate cancer identifies a second risk locus at 8q24 , 2007, Nature Genetics.

[49]  Mark R. Segal,et al.  Regression Trees for Censored Data , 1988 .

[50]  Michael LeBlanc,et al.  Adaptive risk group refinement. , 2005, Biometrics.

[51]  Mee Young Park,et al.  L1‐regularization path algorithm for generalized linear models , 2007 .

[52]  K K Matthay,et al.  Evidence for an age cutoff greater than 365 days for neuroblastoma risk group stratification in the Children's Oncology Group. , 2005, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[53]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[54]  Jerome H. Friedman Multivariate adaptive regression splines (with discussion) , 1991 .