Variable importance in matched case–control studies in settings of high dimensional data

type="main" xml:id="rssc12056-abs-0001"> We propose a method for assessing variable importance in matched case–control investigations and other highly stratified studies characterized by high dimensional data (p>>n). In simulated and real data sets, we show that the algorithm proposed performs better than a conventional univariate method (conditional logistic regression) and a popular multivariable algorithm (random forests) that does not take the matching into account. The methods are applicable to wide ranging, high impact clinical studies including metabolomic, proteomic studies and neuroimaging analyses, such as those assessing stroke and Alzheimer's disease. The methods proposed have been implemented in a freely available R library ( http://cran .r-project.org/web/packages/RPCLR/index.html ).

[1]  I. Sheyhidin,et al.  New potential biomarkers in the diagnosis of esophageal squamous cell carcinoma , 2009, Biomarkers : biochemical indicators of exposure, response, and susceptibility to chemicals.

[2]  Sijian Wang,et al.  RANDOM LASSO. , 2011, The annals of applied statistics.

[3]  Q. Tan,et al.  Feature Selection for Predicting Tumor Metastases in Microarray Experiments using Paired Design , 2007 .

[4]  Elsayed Z Soliman,et al.  A clinical risk score for atrial fibrillation in a biracial prospective cohort (from the Atherosclerosis Risk in Communities [ARIC] study). , 2011, The American journal of cardiology.

[5]  Nick C Fox,et al.  Accuracy of dementia diagnosis—a direct comparison between radiologists and a computerized method , 2008, Brain : a journal of neurology.

[6]  Irina Dinu,et al.  Boosting for Correlated Binary Classification , 2010 .

[7]  Mihaela Campan,et al.  Identification of a panel of sensitive and specific DNA methylation markers for squamous cell lung cancer , 2008, Molecular Cancer.

[8]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[9]  Rebecca A Betensky,et al.  Variable selection and prediction using a nested, matched case‐control study: Application to hospital acquired pneumonia in stroke patients , 2014, Biometrics.

[10]  John D. Storey A direct approach to false discovery rates , 2002 .

[11]  D. Levy,et al.  Prediction of coronary heart disease using risk factor categories. , 1998, Circulation.

[12]  Wendy Cozen,et al.  Identification of a panel of sensitive and specific DNA methylation markers for lung adenocarcinoma , 2007, Molecular Cancer.

[13]  Mee Young Park,et al.  Penalized logistic regression for detecting gene interactions. , 2008, Biostatistics.

[14]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[15]  G. Yule On the Methods of Measuring Association between Two Attributes , 1912 .