Generalized additive models with principal component analysis: an application to time series of respiratory disease and air pollution data

  Environmental epidemiological studies of the health effects of air pollution frequently utilize the generalized additive model (GAM) as the standard statistical methodology, considering the ambient air pollutants as explanatory covariates. Although exposure to air pollutants is multi†dimensional, the majority of these studies consider only a single pollutant as a covariate in the GAM model. This model restriction may be because the pollutant variables do not only have serial dependence but also interdependence between themselves. In an attempt to convey a more realistic model, we propose here the hybrid generalized additive model–principal component analysis–vector auto†regressive (GAM–PCA–VAR) model, which is a combination of PCA and GAMs along with a VAR process. The PCA is used to eliminate the multicollinearity between the pollutants whereas the VAR model is used to handle the serial correlation of the data to produce white noise processes as covariates in the GAM. Some theoretical and simulation results of the methodology proposed are discussed, with special attention to the effect of time correlation of the covariates on the PCA and, consequently, on the estimates of the parameters in the GAM and on the relative risk, which is a commonly used statistical quantity to measure the effect of the covariates, especially the pollutants, on population health. As a main motivation to the methodology, a real data set is analysed with the aim of quantifying the association between respiratory disease and air pollution concentrations, especially particulate matter PM10, sulphur dioxide, nitrogen dioxide, carbon monoxide and ozone. The empirical results show that the GAM–PCA–VAR model can remove the auto†correlations from the principal components. In addition, this method produces estimates of the relative risk, for each pollutant, which are not affected by the serial correlation in the data. This, in general, leads to more pronounced values of the estimated risk compared with the standard GAM model, indicating, for this study, an increase of almost 5.4% in the risk of PM10, which is one of the most important pollutants which is usually associated with adverse effects on human health.

[1]  G. Zou,et al.  A modified poisson regression approach to prospective studies with binary data. , 2004, American journal of epidemiology.

[2]  B. Ostro,et al.  Air pollution and health effects: A study of medical visits among children in Santiago, Chile. , 1999, Environmental health perspectives.

[3]  David S. Matteson,et al.  Dynamic Orthogonal Components for Multivariate Time Series , 2011 .

[4]  Estimating the number of common factors in serially dependent approximate factor models , 2012 .

[5]  Benjamin Kedem,et al.  Regression Models for Time Series Analysis: Kedem/Time Series Analysis , 2005 .

[6]  A. Figueiras,et al.  A bootstrap method to avoid the effect of concurvity in generalised additive models in time series studies of air pollution , 2005, Journal of Epidemiology and Community Health.

[7]  Siquan Tian,et al.  A comparison between two GAM models in quantifying relationships of environmental variables with fish richness and diversity indices , 2014, Aquatic Ecology.

[8]  Hao Helen Zhang Splines in Nonparametric Regression , 2014 .

[9]  R. Burnett,et al.  The Effect of Concurvity in Generalized Additive Models Linking Mortality to Ambient Particulate Matter , 2003, Epidemiology.

[10]  F. Dominici,et al.  Fine particulate air pollution and hospital admission for cardiovascular and respiratory diseases. , 2006, JAMA.

[11]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[12]  J. Friedman Multivariate adaptive regression splines , 1990 .

[13]  H. R. Anderson,et al.  Assessment and prevention of acute health effects of weather conditions in Europe, the PHEWE project: background, objectives, design , 2007, Environmental health : a global access science source.

[14]  Xiaohui Xu,et al.  Ambient air pollution and hospital admission in Shanghai, China. , 2010, Journal of hazardous materials.

[15]  Yaping Wang,et al.  Analyzing the effects of air pollution and mortality by generalized additive models with robust principal components , 2011, Int. J. Syst. Assur. Eng. Manag..

[16]  Murat Kulahci,et al.  Impact of Autocorrelation on Principal Components and Their Use in Statistical Process Control , 2016, Qual. Reliab. Eng. Int..

[17]  Yu-Pin Hu,et al.  Principal Volatility Component Analysis , 2014 .

[18]  Howard H. Chang,et al.  A simulation study to quantify the impacts of exposure measurement error on air pollution health risk estimates in copollutant time-series models , 2016, Environmental Health.

[19]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[20]  Michael J. Campbell,et al.  Time Series Regression for Counts: An Investigation into the Relationship between Sudden Infant Death Syndrome and Environmental Temperature , 1994 .

[21]  Peter Dalgaard,et al.  Introductory statistics with R , 2002, Statistics and computing.

[22]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[23]  Kazuhiko Ito,et al.  Distributed Lag Analyses of Daily Hospital Admissions and Source-Apportioned Fine Particle Air Pollution , 2010, Environmental health perspectives.

[24]  J. Schwartz,et al.  Harvesting and long term exposure effects in the relation between air pollution and mortality. , 2000, American journal of epidemiology.

[25]  Comparing Estimates of the Effects of Air Pollution on Human Mortality Obtained Using Different Regression Methodologies , 1997, Risk analysis : an official publication of the Society for Risk Analysis.

[26]  James W Hardin,et al.  Testing Approaches for Overdispersion in Poisson Regression versus the Generalized Poisson Model , 2007, Biometrical journal. Biometrische Zeitschrift.

[27]  A. Braga,et al.  [Air pollution and respiratory diseases among children in the city of Curitiba, Brazil]. , 2004, Revista de saude publica.

[28]  Steven Roberts,et al.  Using Supervised Principal Components Analysis to Assess Multiple Pollutant Effects , 2006, Environmental health perspectives.

[29]  F. Dominici,et al.  On the use of generalized additive models in time-series studies of air pollution and health. , 2002, American journal of epidemiology.