Spatial model fitting for large datasets with applications to climate and microarray problems

Many problems in the environmental and biological sciences involve the analysis of large quantities of data. Further, the data in these problems are often subject to various types of structure and, in particular, spatial dependence. Traditional model fitting often fails due to the size of the datasets since it is difficult to not only specify but also to compute with the full covariance matrix describing the spatial dependence. We propose a very general type of mixed model that has a random spatial component. Recognizing that spatial covariance matrices often exhibit a large number of zero or near-zero entries, covariance tapering is used to force near-zero entries to zero. Then, taking advantage of the sparse nature of such tapered covariance matrices, backfitting is used to estimate the fixed and random model parameters. The novelty of the paper is the combination of the two techniques, tapering and backfitting, to model and analyze spatial datasets several orders of magnitude larger than those datasets typically analyzed with conventional approaches. Results will be demonstrated with two datasets. The first consists of regional climate model output that is based on an experiment with two regional and two driver models arranged in a two-by-two layout. The second is microarray data used to build a profile of differentially expressed genes relating to cerebral vascular malformations, an important cause of hemorrhagic stroke and seizures.

[1]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[2]  T. Gneiting Compactly Supported Correlation Functions , 2002 .

[3]  Alan George,et al.  Computer Solution of Large Sparse Positive Definite , 1981 .

[4]  Noel A Cressie,et al.  Statistics for Spatial Data. , 1992 .

[5]  Milton Abramowitz,et al.  Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables , 1964 .

[6]  S. Sain,et al.  Bayesian functional ANOVA modeling using Gaussian process prior distributions , 2010 .

[7]  M. Stein Predicting random fields with increasing dense observations , 1999 .

[8]  Noel A. C. Cressie,et al.  Statistics for Spatial Data: Cressie/Statistics , 1993 .

[9]  Michael L. Stein,et al.  A simple condition for asymptotic optimality of linear predictions of random fields , 1993 .

[10]  Leonhard Held,et al.  Gaussian Markov Random Fields: Theory and Applications , 2005 .

[11]  J. Christensen,et al.  Prediction of Regional Scenarios and Uncertainties for Defining EuropeaN Climate Change Risks and Effects -- PRUDENCE , 2002 .

[12]  D. Nychka,et al.  Multivariate Bayesian analysis of atmosphere–ocean general circulation models , 2007, Environmental and Ecological Statistics.

[13]  H. Fowler,et al.  Estimating change in extreme European precipitation using a multimodel ensemble , 2007 .

[14]  M. Rummukainen,et al.  Evaluating the performance and utility of regional climate models: the PRUDENCE project , 2007 .

[15]  Zongmin Wu,et al.  Compactly supported positive definite radial functions , 1995 .

[16]  Roger Koenker,et al.  SparseM: A Sparse Matrix Package for R , 2003 .

[17]  J. Christensen,et al.  A summary of the PRUDENCE model projections of changes in European climate by the end of this century , 2007 .

[18]  David W. Scott,et al.  Multivariate Density Estimation: Theory, Practice, and Visualization , 1992, Wiley Series in Probability and Statistics.

[19]  R. Tibshirani,et al.  Linear Smoothers and Additive Models , 1989 .

[20]  Michael G. Schimek,et al.  Smoothing and Regression: Approaches, Computation, and Application , 2000 .

[21]  Michael L. Stein,et al.  Interpolation of spatial data , 1999 .

[22]  Holger Wendland,et al.  Piecewise polynomial, positive definite and compactly supported radial functions of minimal degree , 1995, Adv. Comput. Math..

[23]  Mike Rees,et al.  5. Statistics for Spatial Data , 1993 .

[24]  Issam A Awad,et al.  Differential Gene Expression in Human Cerebrovascular Malformations , 2003, Neurosurgery.

[25]  B. Matérn Spatial variation : Stochastic models and their application to some problems in forest surveys and other sampling investigations , 1960 .

[26]  Robert W. Ritzi,et al.  Introduction to Geostatistics: Applications in Hydrogeology , 1998 .

[27]  T. Gneiting Correlation functions for atmospheric data analysis , 1999 .

[28]  Xuming He,et al.  Detecting Differential Expressions in GeneChip Microarray Studies , 2007 .

[29]  Wing Hung Wong,et al.  Model-based analysis of oligonucleotide arrays and issues in cDNA microarray analysis , 2003 .

[30]  S. Pissanetzky Sparse Matrix Algebra , 1984 .

[31]  J. Pasciak,et al.  Computer solution of large sparse positive definite systems , 1982 .

[32]  D. Nychka,et al.  Spatial patterns of probabilistic temperature change projections from a multivariate Bayesian analysis , 2007 .

[33]  D. Harville Matrix Algebra From a Statistician's Perspective , 1998 .

[34]  N. Cressie,et al.  Mean squared prediction error in the spatial linear model with estimated covariance parameters , 1992 .

[35]  Michael L. Stein,et al.  Uniform Asymptotic Optimality of Linear Predictions of a Random Field Using an Incorrect Second-Order Structure , 1990 .

[36]  Charles R. Johnson,et al.  Topics in Matrix Analysis , 1991 .

[37]  Stephan R. Sain,et al.  spam: A Sparse Matrix R Package with Emphasis on MCMC Methods for Gaussian Markov Random Fields , 2010 .

[38]  Irene A. Stegun,et al.  Handbook of Mathematical Functions. , 1966 .

[39]  D. Nychka,et al.  Covariance Tapering for Interpolation of Large Spatial Datasets , 2006 .

[40]  J. Friedman,et al.  Estimating Optimal Transformations for Multiple Regression and Correlation. , 1985 .

[41]  Noel A Cressie,et al.  Combining regional climate model output via a multivariate Markov random field model , 2007 .

[42]  D. Lockhart,et al.  Expression monitoring by hybridization to high-density oligonucleotide arrays , 1996, Nature Biotechnology.

[43]  D. Edwards,et al.  Statistical Analysis of Gene Expression Microarray Data , 2003 .

[44]  Carol A. Gotway,et al.  Statistical Methods for Spatial Data Analysis , 2004 .

[45]  Ross Ihaka,et al.  Gentleman R: R: A language for data analysis and graphics , 1996 .

[46]  M. Stein,et al.  A Bayesian analysis of kriging , 1993 .