Enriching a Large-Scale Survey from a Representative Sample by Data Fusion: Models and Validation

Data Fusion is a series of operations which takes advantage of collected information. Here we present a complete, real practice of Data Fusion, focussing on all the necessary operational steps carried out. These steps define the actual key points of such a procedure: selection of the hinge variables, grafting donors and recipients, choosing the imputation model and assessing the quality of the imputed data. We present a standard methodology for calibrating the convenience of the chosen imputation model. To that end we use a validation suite of seven statistics that measure different facets of the quality of the imputed data: comparing the marginal global statistics, assessing the truthfulness of imputed values and evaluating the goodness of fit of the imputed data. To measure the adequacy of the recipient individuals in respect to the donor set, we compute the significance of the validation statistics by bootstrapping under the assumption that recipients are a random sample of the donor population. To illustrate the proposed approach, we perform a real data fusion operation on the victimization of citizens, where the collected imputation of opinion on perceived safety is used to enrich a large scale survey on citizen victimization.

[1]  Gilbert Saporta,et al.  Probabilités, Analyse des données et statistique , 1991 .

[2]  J L Schafer,et al.  Multiple Imputation for Multivariate Missing-Data Problems: A Data Analyst's Perspective. , 1998, Multivariate behavioral research.

[3]  Marcello D'Orazio,et al.  Statistical Matching: Theory and Practice , 2006 .

[4]  M. Kozak On Sample Allocation in Multivariate Surveys , 2006 .

[5]  Nonparametric evaluation of matching noise , 2006 .

[6]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[7]  Tomàs Aluja-Banet,et al.  File grafting in market research , 1999 .

[8]  Susanne Rässler,et al.  Statistical Matching: "A Frequentist Theory, Practical Applications, And Alternative Bayesian Approaches" , 2002 .

[9]  A. L. V. D. Wollenberg Redundancy analysis an alternative for canonical correlation analysis , 1977 .

[10]  Amar Gupta,et al.  Data Fusion Through Statistical Matching , 2015 .

[11]  J. Schafer Multiple imputation: a primer , 1999, Statistical methods in medical research.

[12]  R. Clarke,et al.  Theory and Applications of Correspondence Analysis , 1985 .

[13]  Susanne Rässler,et al.  Data Fusion: Identification Problems, Validity, and Multiple Imputation , 2004 .

[14]  Jim Burridge,et al.  Information preserving statistical obfuscation , 2003, Stat. Comput..

[15]  A. Morineau,et al.  Multivariate descriptive statistical analysis , 1984 .

[16]  Josep Daunis-i-Estadella,et al.  Assessing the uncertainty in knn Data Fusion , 2009, EGC.

[17]  Calyampudi R. Rao The use and interpretation of principal component analysis in applied research , 1964 .

[18]  Tomàs Aluja-Banet,et al.  GRAFT, a complete system for data fusion , 2007, Comput. Stat. Data Anal..

[19]  D. Rubin,et al.  Multiple Imputation for Nonresponse in Surveys , 1989 .