Variable Importance Analysis with the multiPIM R Package

We describe the R package multiPIM, including statistical background, functionality and user options. The package is for variable importance analysis, and is meant primarily for analyzing data from exploratory epidemiological studies, though it could certainly be applied in other areas as well. The approach taken to variable importance comes from the causal inference field, and is different from approaches taken in other R packages. By default, multiPIM uses a double robust targeted maximum likelihood estimator (TMLE) of a parameter akin to the attributable risk. Several regression methods/machine learning algorithms are available for estimating the nuisance parameters of the models, including super learner, a meta-learner which combines several different algorithms into one. We describe a simulation in which the double robust TMLE is compared to the graphical computation estimator. We also provide example analyses using two data sets which are included with the package. 2 R Package multiPIM

[1]  Nicholas P. Jewell,et al.  A Machine-Learning Algorithm for Estimating and Ranking the Impact of Environmental Risk Factors in Exploratory Epidemiological Studies , 2020, Statistical Modeling for Biological Systems.

[2]  M. J. van der Laan,et al.  The International Journal of Biostatistics Collaborative Double Robust Targeted Maximum Likelihood Estimation , 2011 .

[3]  R. Rosenman,et al.  Coronary heart disease in the Western Collaborative Group Study , 1970 .

[4]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[5]  Mark J van der Laan,et al.  Super Learning: An Application to the Prediction of HIV-1 Drug Resistance , 2007, Statistical applications in genetics and molecular biology.

[6]  Mark J. van der Laan,et al.  Cross-Validated Targeted Minimum-Loss-Based Estimation , 2011 .

[7]  R. Brand,et al.  Coronary heart disease in Western Collaborative Group Study. Final follow-up experience of 8 1/2 years. , 1975, JAMA.

[8]  J. Robins,et al.  Locally Efficient Estimation in Censored Data Models: Theory and Examples , 2000 .

[9]  Alan E Hubbard,et al.  Population intervention models in causal inference. , 2008, Biometrika.

[10]  Mark J. van der Laan,et al.  Introduction to TMLE , 2011 .

[11]  J. M. Oakes,et al.  Effects of socioeconomic and racial residential segregation on preterm birth: a cautionary tale of structural confounding. , 2010, American journal of epidemiology.

[12]  Mark J. van der Laan,et al.  Super Learning for Right-Censored Data , 2011 .

[13]  M. J. Laan Statistical Inference for Variable Importance , 2006 .

[14]  S Greenland,et al.  Maximum likelihood estimation of the attributable fraction from logistic models. , 1993, Biometrics.

[15]  C. J. Stone,et al.  Polychotomous Regression , 1995 .

[16]  M. J. van der Laan,et al.  Statistical Applications in Genetics and Molecular Biology Super Learner , 2010 .

[17]  Mark J van der Laan,et al.  Deletion/Substitution/Addition Algorithm in Learning with Applications in Genomics , 2004, Statistical applications in genetics and molecular biology.

[18]  P. Mortensen EPIDEMIOLOGY , 2012, Schizophrenia Research.

[19]  Mark J. van der Laan,et al.  tmle : An R Package for Targeted Maximum Likelihood Estimation , 2012 .

[20]  Pierre L'Ecuyer,et al.  An Object-Oriented Random-Number Package with Many Long Streams and Substreams , 2002, Oper. Res..

[21]  M. J. Laan,et al.  Targeted Learning: Causal Inference for Observational and Experimental Data , 2011 .

[22]  Max Kuhn,et al.  Building Predictive Models in R Using the caret Package , 2008 .

[23]  S. Liang,et al.  Factors influencing the transmission of Schistosoma japonicum in the mountains of Sichuan Province of China. , 2004, The American journal of tropical medicine and hygiene.

[24]  James M. Robins,et al.  Unified Methods for Censored Longitudinal Data and Causality , 2003 .

[25]  C. Jenkins,et al.  Coronary heart disease in the Western collaborative group study. A follow-up experience of two years. , 1966, JAMA.

[26]  James M. Robins,et al.  Marginal Structural Models versus Structural nested Models as Tools for Causal inference , 2000 .

[27]  Annette M. Molinaro,et al.  Prediction error estimation: a comparison of resampling methods , 2005, Bioinform..

[28]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[29]  J. Robins A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect , 1986 .

[30]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[31]  M. J. Laan,et al.  C-TMLE of an Additive Point Treatment Effect , 2011 .

[32]  M. J. van der Laan,et al.  The International Journal of Biostatistics Targeted Maximum Likelihood Learning , 2011 .

[33]  E. Seto,et al.  Using variable importance measures from causal inference to rank risk factors of schistosomiasis infection in a rural setting in China , 2010, Epidemiologic perspectives & innovations : EP+I.

[34]  Trevor Hastie,et al.  Polynomial splines and their tensor products in extended linear modeling. Discussion and rejoinder , 1997 .

[35]  J. Brooks Why most published research findings are false: Ioannidis JP, Department of Hygiene and Epidemiology, University of Ioannina School of Medicine, Ioannina, Greece , 2008 .

[36]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[37]  J. Goeman L1 Penalized Estimation in the Cox Proportional Hazards Model , 2009, Biometrical journal. Biometrische Zeitschrift.

[38]  Leo Breiman,et al.  Stacked regressions , 2004, Machine Learning.

[39]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .