Exploring copulas for the imputation of complex dependent data

In this work we introduce a copula-based method for imputing missing data by using conditional density functions of the missing variables given the observed ones. In theory, such functions can be derived from the multivariate distribution of the variables of interest. In practice, it is very difficult to model joint distributions and derive conditional distributions, especially when the margins are different. We propose a natural solution to the problem by exploiting copulas so that we derive conditional density functions through the corresponding conditional copulas. The approach is appealing since copula functions enable us (1) to fit any combination of marginal distribution functions, (2) to take into account complex multivariate dependence relationships and (3) to model the marginal distributions and the dependence structure separately. We describe the method and perform a Monte Carlo study in order to compare it with two well-known imputation techniques: the nearest neighbour donor imputation and the regression imputation by EM algorithm. Our results indicate that the proposal compares favourably with classical methods in terms of preservation of microdata, margins and dependence structure.

[1]  G. Kalton,et al.  The treatment of missing survey data , 1986 .

[2]  Ja-Yong Koo,et al.  On the Use of Adaptive Nearest Neighbors for Missing Value Imputation , 2007, Commun. Stat. Simul. Comput..

[3]  E. Luciano,et al.  Copula methods in finance , 2004 .

[4]  Joseph L Schafer,et al.  Analysis of Incomplete Multivariate Data , 1997 .

[5]  R. Little Missing-Data Adjustments in Large Surveys , 1988 .

[6]  María del Mar Rueda,et al.  New imputation methods for missing data using quantiles , 2009, J. Comput. Appl. Math..

[7]  Pravin K. Trivedi,et al.  Using Trivariate Copulas to Model Sample Selection and Treatment Effects , 2006 .

[8]  T. Valdés,et al.  Global dynamics of a system governing an algorithm for regression with censored and non-censored data under general errors , 2004 .

[9]  Ying Wang,et al.  Model, properties and imputation method of missing SNP genotype data utilizing mutual information , 2009 .

[10]  Pravin K. Trivedi,et al.  Copula Modeling: An Introduction for Practitioners , 2007 .

[11]  Ene Käärik,et al.  Modeling dropouts by conditional distribution, a copula-based approach , 2009 .

[12]  J. Shao,et al.  Nearest Neighbor Imputation for Survey Data , 2000 .

[13]  E. Luciano,et al.  Copula Methods in Finance: Cherubini/Copula , 2004 .

[14]  H. Joe,et al.  The Estimation Method of Inference Functions for Margins for Multivariate Models , 1996 .

[15]  Wolfgang Hörmann,et al.  Inverse transformed density rejection for unbounded monotone densities , 2007, TOMC.

[16]  M. Sklar Fonctions de repartition a n dimensions et leurs marges , 1959 .

[17]  Berthold Schweizer Introduction to Copulas , 2007 .