LASSO-Patternsearch Algorithm with Application to Ophthalmology Data

The LASSO-Patternsearch is proposed, as a two-stage procedure to identify clusters of multiple risk factors for outcomes of interest in large demographic studies, when the predictor variables are dichotomous or take on values in a small finite set. Many diseases are suspected of having multiple interacting risk factors acting in concert, and it is of much interest to uncover higher order interactions when they exist. The method is related to Zhang et al(2004) except that variable flexibility is sacrificed to allow entertaining models with high as well as low order interactions among multiple predictors. A LASSO is used to select important patterns, being applied conservatively to have a high rate of retention of true patterns, while allowing some noise. Then the patterns selected by the LASSO are tested in the framework of (parametric) generalized linear models to reduce the noise. Notably, the patterns are those that arise naturally from the log linear expansion of the multivariate Bernoulli density. Separate tuning procedures are proposed for the LASSO step and then the parametric step and a novel computational algorithm for the LASSO step is developed to handle the large number of unknowns in the problem. The method is applied to data from the Beaver Dam Eye Study and is shown to expose physiologically interesting interacting risk factors. In a study of progression of myopia in an older cohort, it is found in this group that the risk for smokers is reduced by taking vitamins, while the risk for non-smokers is independent of the “taking vitamins” variable, which is in agreement with the general result that smoking reduces the absorption of vitamins, and certain vitamins have been

[1]  G. Wahba,et al.  A completely automatic french curve: fitting spline functions by cross validation , 1975 .

[2]  Peter Craven,et al.  Smoothing noisy data with spline functions , 1978 .

[3]  D. Bertsekas Projected Newton methods for optimization problems with simple constraints , 1981, CDC 1981.

[4]  Dimitri P. Bertsekas,et al.  Constrained Optimization and Lagrange Multiplier Methods , 1982 .

[5]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[6]  Ker-Chau Li,et al.  Asymptotic optimality of CL and generalized cross-validation in ridge regression with application to spline smoothing , 1986 .

[7]  R. Klein,et al.  The Beaver Dam Eye Study: visual acuity. , 1991, Ophthalmology.

[8]  R. Klein,et al.  Are sex hormones associated with age-related maculopathy in women? The Beaver Dam Eye Study. , 1994, Transactions of the American Ophthalmological Society.

[9]  R. Klein,et al.  Alcohol use and age-related maculopathy in the Beaver Dam Eye Study. , 1995, American journal of ophthalmology.

[10]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[11]  G. Wahba,et al.  A GENERALIZED APPROXIMATE CROSS VALIDATION FOR SMOOTHING SPLINES WITH NON-GAUSSIAN DATA , 1996 .

[12]  R. Klein,et al.  The relation of cardiovascular disease and its risk factors to the 5-year incidence of age-related maculopathy: the Beaver Dam Eye Study. , 1997, Ophthalmology.

[13]  Xiwu Lin Smoothing Spline Analysis Of Variance For Polychotomous Response Data , 1998 .

[14]  Dan Steinberg,et al.  THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING , 1998 .

[15]  R. Klein,et al.  Alcohol consumption and the 5-year incidence of age-related maculopathy: the Beaver Dam eye study. , 1998, Ophthalmology.

[16]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[17]  Xiwu Lin,et al.  Smoothing spline ANOVA models for large data sets with Bernoulli observations and the randomized GACV , 2000 .

[18]  Wenjiang J. Fu,et al.  Asymptotics for lasso-type estimators , 2000 .

[19]  M. R. Osborne,et al.  On the LASSO and its Dual , 2000 .

[20]  G. Wahba,et al.  Smoothing Spline ANOVA for Multivariate Bernoulli Observations With Application to Ophthalmology Data , 2001 .

[21]  Grace Wahba,et al.  Soft and hard classification by reproducing kernel Hilbert space methods , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[22]  R. Klein,et al.  Changes in refraction over 10 years in an adult population: the Beaver Dam Eye study. , 2002, Investigative ophthalmology & visual science.

[23]  Ingo Ruczinski,et al.  Logic Regression — Methods and Software , 2003 .

[24]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[25]  Meta M. Voelker,et al.  Variable Selection and Model Building via Likelihood Basis Pursuit , 2004 .

[26]  W. Loh,et al.  LOTUS: An Algorithm for Building Accurate and Comprehensible Logistic Regression Trees , 2004 .

[27]  Steve R. Gunn,et al.  Structural Modelling with Sparse Kernels , 2002, Machine Learning.

[28]  P. Galan,et al.  Serum concentrations of β-carotene, vitamins C and E, zinc and selenium are influenced by sex, age, diet, smoking status, alcohol consumption and corpulence in a general French adult population , 2005, European Journal of Clinical Nutrition.

[29]  Hao Helen Zhang,et al.  COMPONENT SELECTION AND SMOOTHING FOR NONPARAMETRIC REGRESSION IN EXPONENTIAL FAMILIES , 2006 .

[30]  G. Wahba,et al.  A NOTE ON THE LASSO AND RELATED PROCEDURES IN MODEL SELECTION , 2006 .

[31]  Gene H. Golub,et al.  Generalized cross-validation as a method for choosing a good ridge parameter , 1979, Milestones in Matrix Computation.