Variable selection and prediction using a nested, matched case‐control study: Application to hospital acquired pneumonia in stroke patients

Matched case-control designs are commonly used in epidemiologic studies for increased efficiency. These designs have recently been introduced to the setting of modern imaging and genomic studies, which are characterized by high-dimensional covariates. However, appropriate statistical analyses that adjust for the matching have not been widely adopted. A matched case-control study of 430 acute ischemic stroke patients was conducted at Massachusetts General Hospital (MGH) in order to identify specific brain regions of acute infarction that are associated with hospital acquired pneumonia (HAP) in these patients. There are 138 brain regions in which infarction was measured, which introduce nearly 10,000 two-way interactions, and challenge the statistical analysis. We investigate penalized conditional and unconditional logistic regression approaches to this variable selection problem that properly differentiate between selection of main effects and of interactions, and that acknowledge the matching. This neuroimaging study was nested within a larger prospective study of HAP in 1915 stroke patients at MGH, which recorded clinical variables, but did not include neuroimaging. We demonstrate how the larger study, in conjunction with the nested, matched study, affords us the capability to derive a score for prediction of HAP in future stroke patients based on imaging and clinical features. We evaluate the proposed methods in simulation studies and we apply them to the MGH HAP study.

[1]  Y. Amoateng-Adjepong,et al.  Predictors and consequences of pneumonia in critically ill patients with stroke. , 2004, Journal of critical care.

[2]  R. Cebul,et al.  The cost of pneumonia after acute stroke , 2007, Neurology.

[3]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[4]  D. Louis Collins,et al.  Relating one-year cognitive change in mild cognitive impairment to baseline MRI features , 2009, NeuroImage.

[5]  Irina Dinu,et al.  Boosting for Correlated Binary Classification , 2010 .

[6]  W. Willett,et al.  The combined influence of multiple sex and growth hormones on risk of postmenopausal breast cancer: a nested case-control study , 2011, Breast Cancer Research.

[7]  P. Bühlmann,et al.  The group lasso for logistic regression , 2008 .

[8]  C. Warlow,et al.  Complications after acute stroke. , 1996, Stroke.

[9]  Raji Balasubramanian,et al.  Variable importance in matched case–control studies in settings of high dimensional data , 2014 .

[10]  N. Mantel Synthetic retrospective studies and related topics. , 1973, Biometrics.

[11]  Runze Li,et al.  Variable selection for multivariate failure time data. , 2005, Biometrika.

[12]  Ravi S. Menon,et al.  Cerebral cortical representation of automatic and volitional swallowing in humans. , 2001, Journal of neurophysiology.

[13]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[14]  W. Heiss,et al.  Nosocomial Pneumonia After Acute Stroke: Implications for Neurological Intensive Care Medicine , 2003, Stroke.

[15]  Mark Woodward,et al.  Epidemiology: Study Design and Data Analysis , 1999 .

[16]  R. Pyke,et al.  Logistic disease incidence models and case-control studies , 1979 .

[17]  Sander Greenland,et al.  Modern Epidemiology 3rd edition , 1986 .

[18]  Z. Ying,et al.  Cox Regression with Incomplete Covariate Measurements , 1993 .

[19]  Trevor J. Hastie,et al.  Genome-wide association analysis by lasso penalized logistic regression , 2009, Bioinform..

[20]  X. Wu,et al.  Individual patient diagnosis of AD and FTD via high-dimensional pattern classification of MRI , 2008, NeuroImage.

[21]  K. Roeder,et al.  Screen and clean: a tool for identifying interactions in genome‐wide association studies , 2010, Genetic epidemiology.

[22]  Q. Tan,et al.  Feature Selection for Predicting Tumor Metastases in Microarray Experiments using Paired Design , 2007 .

[23]  N. Breslow,et al.  Statistical methods in cancer research. Vol. 1. The analysis of case-control studies. , 1981 .

[24]  Jianqing Fan,et al.  A Selective Overview of Variable Selection in High Dimensional Feature Space. , 2009, Statistica Sinica.