Imputation of missing data via penalization techniques

The aim of this master thesis is to give the user an estimate of uncertainty over missing data imputation. The full factorization approach is compared to the state-of-the-art approach of full conditional. The special feature in both algorithms is the penalization techniques. Both algorithms are used with different types of missing data like MAR, MCAR and NMAR. Simulated datasets were conducted with copulas. Simulations were varied in rate of missing observations, refitting times and the use, or not, of LASSO regression for last fit. Results are given in terms of accuracy of predicted values, pooled variance estimates and errors which occurred during programming and runtime. The full factorization approach showed advantages over full conditional especially if one looks on ratios of 10:1 observations to covariables. In cases were covariables were in higher numbers than observations, full conditional and full factorization nearly covered same results when ridge regression was used for all fits. Generally lasso regression did not improve accuracy of imputation results. This result can be generalized for all missing types used and simulations conducted in this master thesis. Imputed observations showed paths which are similar to MCMC bayesian statistics. The imputation steps alternated and converged to certain values. Results were stable when different datasets and seeds for random numbers were processed. Run time for both approaches was high due to errors that occurred in different R packages and functions which are costly in terms of CPU usage.

[1]  A. Tikhonov On the stability of inverse problems , 1943 .

[2]  Jun Yan,et al.  Enjoy the Joy of Copulas: With a Package copula , 2007 .

[3]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[4]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[5]  Hongtu Zhu,et al.  VARIABLE SELECTION FOR REGRESSION MODELS WITH MISSING DATA. , 2010, Statistica Sinica.

[6]  Jörg Drechsler,et al.  Does Convergence Really Matter , 2008 .

[7]  G. A. Young,et al.  Recent Developments in Bootstrap Methodology , 2003 .

[8]  Marius Hofert,et al.  Nested Archimedean Copulas Meet R: The nacopula Package , 2011 .

[9]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[10]  J. Schafer Multiple imputation: a primer , 1999, Statistical methods in medical research.

[11]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[12]  Joseph L Schafer,et al.  Analysis of Incomplete Multivariate Data , 1997 .

[13]  Cedric E. Ginestet ggplot2: Elegant Graphics for Data Analysis , 2011 .

[14]  Jun Yan,et al.  Modeling Multivariate Distributions with Continuous Margins Using the copula R Package , 2010 .

[15]  T. Hothorn,et al.  Multivariate Normal and t Distributions , 2016 .

[16]  D. Rubin,et al.  Statistical Analysis with Missing Data , 1988 .

[17]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[18]  Ludwig Fahrmeir,et al.  Regression: Models, Methods and Applications , 2013 .