A practical comparison of single and multiple imputation methods to handle complex missing data in air quality datasets

Abstract Datasets with missing data ratios ranging from 24% to 4%, corresponding to three air quality monitoring studies, were used to ascertain whether major differences occur when five currently used imputation methods are applied (four single imputation methods and a multiple imputation one). Unrotated and Varimax-rotated factor analyses performed on the imputed datasets were compared. All methods performed similarly, although multiple imputation yielded more disperse imputed values. Main differences occurred when a variable with missing values correlated poorly to the other features and when a variable had relevant loadings in several unrotated factors, which sometimes changed the order of the rotated factors.

[1]  J L Schafer,et al.  Multiple Imputation for Multivariate Missing-Data Problems: A Data Analyst's Perspective. , 1998, Multivariate behavioral research.

[2]  R. Bro,et al.  PARAFAC and missing values , 2005 .

[3]  T. Schneider Analysis of Incomplete Climate Data: Estimation of Mean Values and Covariance Matrices and Imputation of Missing Values. , 2001 .

[4]  M Daszykowski,et al.  Dealing with missing values and outliers in principal component analysis. , 2007, Talanta.

[5]  Mia Hubert,et al.  Robust PARAFAC for incomplete data , 2012 .

[6]  Peter D. Wentzell,et al.  Applications of maximum likelihood principal component analysis: incomplete data sets and calibration transfer , 1997 .

[7]  D. Rubin Multiple imputation for nonresponse in surveys , 1989 .

[8]  Junghui Chen,et al.  Removal of the effects of outliers in batch process data through maximum correntropy estimator , 2012 .

[9]  T. Stijnen,et al.  Review: a gentle introduction to imputation of missing values. , 2006, Journal of clinical epidemiology.

[10]  A. Walmsley,et al.  Recovering incomplete data using Statistical Multiple Imputations (SMI): a case study in environmental chemistry. , 2011, Talanta.

[11]  R. Tauler,et al.  Variation patterns of nitric oxide in Catalonia during the period from 2001 to 2006 using multivariate data analysis methods. , 2009, Analytica chimica acta.

[12]  John F. MacGregor,et al.  Estimation of missing data using latent variable methods with auxiliary information , 2005 .

[13]  P. A. Taylor,et al.  Missing data methods in PCA and PLS: Score calculations with incomplete observations , 1996 .

[14]  Harri Niska,et al.  Methods for imputation of missing values in air quality data sets , 2004 .

[15]  A. Plaia,et al.  Single imputation method of missing values in environmental pollution data sets , 2006 .

[16]  H. Voet,et al.  Stepwise deletion: a technique for missing-data handling in multivariate analysis , 1987 .

[17]  D. Massart,et al.  Dealing with missing data , 2001 .

[18]  Steven D. Brown,et al.  Comparison of five iterative imputation methods for multivariate classification , 2013 .

[19]  D. Massart,et al.  Dealing with missing data: Part II , 2001 .

[20]  J. Schafer Multiple imputation: a primer , 1999, Statistical methods in medical research.

[21]  H. FernandoMedina,et al.  Imputación de datos: teoría y práctica , 2007 .

[22]  Ivana Stanimirova,et al.  How to construct a multiple regression model for data with missing elements and outlying objects. , 2007, Analytica chimica acta.

[23]  A Smoliński,et al.  Exploratory analysis of data sets with missing elements and outliers. , 2002, Chemosphere.

[24]  Sven Serneels,et al.  Principal component analysis for data containing outliers and missing elements , 2008, Comput. Stat. Data Anal..