Collinearity: a review of methods to deal with it and a simulation study evaluating their performance

Collinearity refers to the non independence of predictor variables, usually in a regression-type analysis. It is a common feature of any descriptive ecological data set and can be a problem for parameter estimation because it inflates the variance of regression parameters and hence potentially leads to the wrong identification of relevant predictors in a statistical model. Collinearity is a severe problem when a model is trained on data from one region or time, and predicted to another with a different or unknown structure of collinearity. To demonstrate the reach of the problem of collinearity in ecology, we show how relationships among predictors differ between biomes, change over spatial scales and through time. Across disciplines, different approaches to addressing collinearity problems have been developed, ranging from clustering of predictors, threshold-based pre-selection, through latent variable methods, to shrinkage and regularisation. Using simulated data with five predictor-response relationships of increasing complexity and eight levels of collinearity we compared ways to address collinearity with standard multiple regression and machine-learning approaches. We assessed the performance of each approach by testing its impact on prediction to new data. In the extreme, we tested whether the methods were able to identify the true underlying relationship in a training dataset with strong collinearity by evaluating its performance on a test dataset without any collinearity. We found that methods specifically designed for collinearity, such as latent variable methods and tree based models, did not outperform the traditional GLM and threshold-based pre-selection. Our results highlight the value of GLM in combination with penalised methods (particularly ridge) and threshold-based pre-selection when omitted variables are considered in the final interpretation. However, all approaches tested yielded degraded predictions under change in collinearity structure and the ‘folk lore’-thresholds of correlation coefficients between predictor variables of |r| >0.7 was an appropriate indicator for when collinearity begins to severely distort model estimation and subsequent prediction. The use of ecological understanding of the system in pre-analysis variable selection and the choice of the least sensitive statistical approaches reduce the problems of collinearity, but cannot ultimately solve them.

[1]  Ker-Chau Li,et al.  Sliced Inverse Regression for Dimension Reduction , 1991 .

[2]  John Guerard,et al.  The Handbook of Financial Modeling: The Financial Executive's Reference Guide to Accounting, Finance, and Investment Models , 1989 .

[3]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[4]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[5]  M. Austin,et al.  Searching for a model for use in vegetation analysis , 1980, Vegetatio.

[6]  I. Jolliffe Principal Component Analysis , 2002 .

[7]  Douglas M. Hawkins,et al.  The Cholesky factorization of the inverse correlation or covariance matrix in multiple regression , 1982 .

[8]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[9]  C. Chatfield Model uncertainty, data mining and statistical inference , 1995 .

[10]  Steven J. Phillips,et al.  The art of modelling range‐shifting species , 2010 .

[11]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[12]  Xianggui Qu,et al.  Multivariate Data Analysis , 2007, Technometrics.

[13]  Nigel G. Yoccoz,et al.  Hierarchical Modelling for the Environmental Sciences , 2007 .

[14]  David A. Belsley,et al.  Regression Analysis and its Application: A Data-Oriented Approach.@@@Applied Linear Regression.@@@Regression Diagnostics: Identifying Influential Data and Sources of Collinearity , 1981 .

[15]  R. Olea,et al.  Geostatistical Analysis of Compositional Data , 2004 .

[16]  Robert L. Mason,et al.  Regression Analysis and Its Application: A Data-Oriented Approach. , 1982 .

[17]  J. Barkoulas,et al.  Econometric Methods I , 2015 .

[18]  H. Zou,et al.  Regression Shrinkage and Selection via the Elastic Net , with Applications to Microarrays , 2003 .

[19]  John Aitchison,et al.  The Statistical Analysis of Compositional Data , 1986 .

[20]  B. Reineking,et al.  Constrain to perform: Regularization of habitat models , 2006 .

[21]  A. E. Hoerl,et al.  Ridge Regression: Applications to Nonorthogonal Problems , 1970 .

[22]  Benjamin A. Campbell,et al.  Encouraging Best Practice in Quantitative Management Research: An Incomplete List of Opportunities , 2006 .

[23]  Age K. Smilde,et al.  A comparison of various methods for multivariate regression with highly collinear variables , 2007, Stat. Methods Appl..

[24]  G. Monette,et al.  Generalized Collinearity Diagnostics , 1992 .

[25]  J. O. Rawlings,et al.  Applied Regression Analysis: A Research Tool , 1988 .

[26]  J. O. Rawlings,et al.  Applied Regression Analysis , 1998 .

[27]  G. Stewart Collinearity and Least Squares Regression , 1987 .

[28]  Majid Ezzati,et al.  Eight Americas: Investigating Mortality Disparities across Races, Counties, and Race-Counties in the United States , 2006, PLoS medicine.

[29]  Alain F. Zuur,et al.  A protocol for data exploration to avoid common statistical problems , 2010 .

[30]  H. Abdi Partial Least Squares (PLS) Regression. , 2003 .

[31]  M. Conner,et al.  Methods to quantify variable importance: implications for the analysis of noisy ecological data. , 2009, Ecology.

[32]  M. Kearney,et al.  Mechanistic niche modelling: combining physiological and spatial data to predict species' ranges. , 2009, Ecology letters.

[33]  S. Weisberg,et al.  Comments on "Sliced inverse regression for dimension reduction" by K. C. Li , 1991 .

[34]  Nobuya Suzuki,et al.  Developing landscape habitat models for rare amphibians with small geographic ranges: a case study of Siskiyou Mountains salamanders in the western USA , 2008, Biodiversity and Conservation.

[35]  Praveen K. Kopalle,et al.  The impact of collinearity on regression analysis: the asymmetric effect of negative and positive correlations , 2002 .

[36]  J. Freidman,et al.  Multivariate adaptive regression splines , 1991 .

[37]  Robert P. Freckleton,et al.  On the misuse of residuals in ecology: regression of residuals vs. multiple regression , 2002 .

[38]  Boris Schr,et al.  Constrain to perform: Regularization of habitat models , 2006 .

[39]  Janneke Hille Ris Lambers,et al.  Effects of global change on inflorescence production : a Bayesian hierarchical analysis , 2005 .

[40]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[41]  Mik Wisniewski,et al.  Applied Regression Analysis: A Research Tool , 1990 .

[42]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[43]  Lalit Kumar,et al.  Mapping Coastal Vegetation Using an Expert System and Hyperspectral Imagery , 2004 .

[44]  M. Araújo,et al.  How Does Climate Change Affect Biodiversity? , 2006, Science.

[45]  Eric R. Ziegel,et al.  An Introduction to Generalized Linear Models , 2002, Technometrics.

[46]  Gerhard Tutz,et al.  Penalized Partial Least Squares Based on B-Splines Transformations , 2006 .

[47]  Sunil J Rao,et al.  Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis , 2003 .

[48]  R. Tibshirani,et al.  Generalized Additive Models , 1986 .

[49]  Chih-Jen Lin,et al.  Working Set Selection Using Second Order Information for Training Support Vector Machines , 2005, J. Mach. Learn. Res..

[50]  M. Austin Spatial prediction of species distribution: an interface between ecological theory and statistical modelling , 2002 .

[51]  James B. Grace,et al.  Structural Equation Modeling and Natural Systems , 2006 .

[52]  Evelyne Vigneau,et al.  A new method of regression on latent variables. Application to spectral data , 2002 .

[53]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[54]  J. T. Webster,et al.  Latent Root Regression Analysis , 1974 .

[55]  David A. Belsley,et al.  Conditioning Diagnostics: Collinearity and Weak Data in Regression , 1991 .

[56]  Isabella Morlini,et al.  On Multicollinearity and Concurvity in Some Nonlinear Multivariate Models , 2006, Stat. Methods Appl..

[57]  A. Townsend Peterson,et al.  Novel methods improve prediction of species' distributions from occurrence data , 2006 .

[58]  D. Hamilton Sometimes R 2 > r 2 yx 1 + r 2 yx 2 : Correlated Variables are Not Always Redundant , 1987 .

[59]  John R. Christy,et al.  Test for harmful collinearity among predictor variables used in modeling global temperature , 2003 .

[60]  Andrew Gelman,et al.  Data Analysis Using Regression and Multilevel/Hierarchical Models , 2006 .

[61]  Ker-Chau Li Sliced inverse regression for dimension reduction (with discussion) , 1991 .

[62]  W. Thuiller Patterns and uncertainties of species' range shifts under climate change , 2004 .

[63]  Robert L. Mason,et al.  A Comparison of Least Squares and Latent Root Regression Estimators , 1976 .

[64]  Anne Lohrli Chapman and Hall , 1985 .

[65]  Y Vander Heyden,et al.  Boosted regression trees, multivariate adaptive regression splines and their two-step combinations with multiple linear regression or partial least squares to predict blood-brain barrier passage: a case study. , 2008, Analytica chimica acta.

[66]  R. Sokol,et al.  Evaluation of logistic regression reporting in current obstetrics and gynecology literature. , 2008, Obstetrics and gynecology.

[67]  Neima Brauner,et al.  Role of range and precision of the independent variable in regression of data , 1998 .

[68]  A. Skidmore,et al.  The response of elephants to the spatial heterogeneity of vegetation in a Southern African agricultural landscape , 2005, Landscape Ecology.

[69]  H. Bondell,et al.  Simultaneous Regression Shrinkage, Variable Selection, and Supervised Clustering of Predictors with OSCAR , 2008, Biometrics.

[70]  Chris H. Q. Ding,et al.  K-means clustering via principal component analysis , 2004, ICML.

[71]  Joshua J. Lawler,et al.  Cross-scale Correlations and the Design and Analysis of Avian Habitat Selection Studies , 2006 .

[72]  Douglas M. Hawkins,et al.  On the Investigation of Alternative Regressions by Principal Component Analysis , 1973 .

[73]  D. Wheeler Diagnostic Tools and a Remedial Method for Collinearity in Geographically Weighted Regression , 2007 .

[74]  Evelyne Vigneau,et al.  Application of latent root regression for calibration in near-infrared spectroscopy. Comparison with principal component regression and partial least squares , 1996 .

[75]  I. Song,et al.  Working Set Selection Using Second Order Information for Training Svm, " Complexity-reduced Scheme for Feature Extraction with Linear Discriminant Analysis , 2022 .

[76]  Ker-Chau Li,et al.  On Principal Hessian Directions for Data Visualization and Dimension Reduction: Another Application of Stein's Lemma , 1992 .

[77]  Richard D. De Veaux,et al.  Multicollinearity: A tale of two nonparametric regressions , 1994 .

[78]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[79]  David Paull,et al.  Machine learning of poorly predictable ecological data , 2006 .

[80]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[81]  Chris H. Q. Ding,et al.  Spectral Relaxation for K-means Clustering , 2001, NIPS.

[82]  Nicholas W. Synes,et al.  Choice of predictor variables as a source of uncertainty in continental‐scale species distribution modelling under climate change , 2011 .

[83]  R. Tibshirani,et al.  Linear Smoothers and Additive Models , 1989 .

[84]  E Standard,et al.  Statistik für Sozialwissenschaftler , 2012 .

[85]  Paul H. Garthwaite,et al.  Regression methods for high dimensional multicollinear data , 2000 .

[86]  W. W. Muir,et al.  Regression Diagnostics: Identifying Influential Data and Sources of Collinearity , 1980 .

[87]  Timothy J. Robinson,et al.  Linear Models With R , 2005, Technometrics.

[88]  R. Fildes Conditioning Diagnostics: Collinearity and Weak Data in Regression , 1993 .

[89]  Charles M. Francis,et al.  Confronting collinearity: comparing methods for disentangling the effects of habitat loss and fragmentation , 2009, Landscape Ecology.

[90]  R. Dennis Cook,et al.  Optimal sufficient dimension reduction in regressions with categorical predictors , 2002 .

[91]  J. Franklin,et al.  The elements of statistical learning: data mining, inference and prediction , 2005 .

[92]  El Mostafa Qannari,et al.  Principal component regression, ridge regression and ridge principal component regression in spectroscopy calibration , 1997 .

[93]  A. Boulesteix,et al.  Penalized Partial Least Squares with Applications to B-Spline Transformations and Functional Data , 2006, math/0608576.

[94]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[95]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[96]  Stuart H M Butchart,et al.  The coincidence of climatic and species rarity: high risk to small-range species from climate change , 2008, Biology Letters.

[97]  M. Graham CONFRONTING MULTICOLLINEARITY IN ECOLOGICAL MULTIPLE REGRESSION , 2003 .

[98]  R. Brereton,et al.  Crucial problems in regression modelling and their solutions. , 2002, The Analyst.