Analysis of Incomplete Climate Data: Estimation of Mean Values and Covariance Matrices and Imputation of Missing Values.

Estimating the mean and the covariance matrix of an incomplete dataset and filling in missing values with imputed values is generally a nonlinear problem, which must be solved iteratively. The expectation maximization (EM) algorithm for Gaussian data, an iterative method both for the estimation of mean values and covariance matrices from incomplete datasets and for the imputation of missing values, is taken as the point of departure for the development of a regularized EM algorithm. In contrast to the conventional EM algorithm, the regularized EM algorithm is applicable to sets of climate data, in which the number of variables typically exceeds the sample size. The regularized EM algorithm is based on iterated analyses of linear regressions of variables with missing values on variables with available values, with regression coefficients estimated by ridge regression, a regularized regression method in which a continuous regularization parameter controls the filtering of the noise in the data. The regularization parameter is determined by generalized cross-validation, such as to minimize, approximately, the expected mean-squared error of the imputed values. The regularized EM algorithm can estimate, and exploit for the imputation of missing values, both synchronic and diachronic covariance matrices, which may contain information on spatial covariability, stationary temporal covariability, or cyclostationary temporal covariability. A test of the regularized EM algorithm with simulated surface temperature data demonstrates that the algorithm is applicable to typical sets of climate data and that it leads to more accurate estimates of the missing values than a conventional noniterative imputation technique.

[1]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[2]  H. E. Kallmann,et al.  Transient Response , 1945, Proceedings of the IRE.

[3]  S. F. Buck A Method of Estimation of Missing Values in Multivariate Data Suitable for Use with an Electronic Computer , 1960 .

[4]  I. Miller Probability, Random Variables, and Stochastic Processes , 1966 .

[5]  A. E. Hoerl,et al.  Ridge Regression: Applications to Nonorthogonal Problems , 1970 .

[6]  J. T. Webster,et al.  Latent Root Regression Analysis , 1974 .

[7]  E. Beale,et al.  Missing Values in Multivariate Analysis , 1975 .

[8]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[9]  G. Wahba Practical Approximate Solutions to Linear Operator Equations When the Data are Noisy , 1977 .

[10]  P. M. Prenter,et al.  A formal comparison of methods proposed for the numerical solution of first kind integral equations , 1981, The Journal of the Australian Mathematical Society. Series B. Applied Mathematics.

[11]  P. Linz Uncertainty in the solution of linear operator equations , 1984 .

[12]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[13]  John G. Proakis,et al.  Probability, random variables and stochastic processes , 1985, IEEE Trans. Acoust. Speech Signal Process..

[14]  A. Tarantola Inverse problem theory : methods for data fitting and model parameter estimation , 1987 .

[15]  P. Hansen Rank-Deficient and Discrete Ill-Posed Problems: Numerical Aspects of Linear Inversion , 1987 .

[16]  G. Wahba Spline models for observational data , 1990 .

[17]  Sabine Van Huffel,et al.  Total least squares problem - computational aspects and analysis , 1991, Frontiers in applied mathematics.

[18]  Syukuro Manabe,et al.  Transient responses of a coupled ocean-atmosphere model to gradual changes of atmospheric CO2 , 1991 .

[19]  Sabine Van Huffel,et al.  The total least squares problem , 1993 .

[20]  Syukuro Manabe,et al.  Multiple-Century Response of a Coupled Ocean-Atmosphere Model to an Increase of Atmospheric Carbon Dioxide , 1994 .

[21]  P. Jones,et al.  Hemispheric Surface Air Temperature Variations: A Reanalysis and an Update to 1993. , 1994 .

[22]  David Parker,et al.  Interdecadal changes of surface temperature since the late nineteenth century , 1994 .

[23]  Behavior near zero of the distribution of GCV smoothing parameter estimates , 1995 .

[24]  David Parker,et al.  Marine Surface Temperature: Observed Variations and Data Requirements , 1995 .

[25]  Thomas M. Smith,et al.  Reconstruction of Historical Sea Surface Temperatures Using Empirical Orthogonal Functions , 1996 .

[26]  M. Benno Blumenthal,et al.  Reduced space optimal analysis for historical data sets: 136 years of Atlantic sea surface temperatures , 1997 .

[27]  Gene H. Golub,et al.  Regularization by Truncated Total Least Squares , 1997, SIAM J. Sci. Comput..

[28]  Malcolm K. Hughes,et al.  Global-scale temperature patterns and climate forcing over the past six centuries , 1998, Nature.

[29]  Balaji Rajagopalan,et al.  Analyses of global sea surface temperature 1856–1991 , 1998 .

[30]  Gene H. Golub,et al.  Tikhonov Regularization and Total Least Squares , 1999, SIAM J. Matrix Anal. Appl..

[31]  S. Griffies,et al.  A Conceptual Framework for Predictability Studies , 1999 .

[32]  John R. Lanzante,et al.  Global mean surface air temperature and North Atlantic overturning in a suite of coupled GCM climate change experiments , 1999 .

[33]  M. Hughes,et al.  Northern hemisphere temperatures during the past millennium: Inferences, uncertainties, and limitations , 1999 .

[34]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[35]  B. E. Eckbo,et al.  Appendix , 1826, Epilepsy Research.

[36]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.