p-Causality: Identifying Spatiotemporal Causal Pathways for Air Pollutants with Urban Big Data

Many countries are suffering from severe air pollution. Understanding how different air pollutants accumulate and propagate is critical to making relevant public policies. In this paper, we use urban big data (air quality data and meteorological data) to identify the spatiotemporal (ST) causal pathways for air pollutants. This problem is challenging because: (1) there are numerous noisy and low-pollution periods in the raw air quality data, which may lead to unreliable causality analysis; (2) for large-scale data in the ST space, the computational complexity of constructing a causal structure is very high; and (3) the ST causal pathways are complex due to the interactions of multiple pollutants and the influence of environmental factors. Therefore, we present pg-Causality, a novel pattern-aided graphical causality analysis approach that combines the strengths of pattern mining and Bayesian learning to efficiently identify the ST causal pathways. First, pattern mining helps suppress the noise by capturing frequent evolving patterns (FEPs) of each monitoring sensor, and greatly reduce the complexity by selecting the pattern-matched sensors as “causers”. Then, Bayesian learning carefully encodes the local and ST causal relations with a Gaussian Bayesian Network (GBN)-based graphical model, which also integrates environmental influences to minimize biases in the final results. We evaluate our approach with three real-world data sets containing 982 air quality sensors in 128 cities, in three regions of China from 01-Jun-2013 to 31-Dec-2016. Results show that our approach outperforms the traditional causal structure learning methods in time efficiency, inference accuracy and interpretability.

[1]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[2]  E. S. Pearson,et al.  Tests for departure from normality. Empirical results for the distributions of b2 and √b1 , 1973 .

[3]  D. Rubin Estimating causal effects of treatments in randomized and nonrandomized studies. , 1974 .

[4]  D. Rubin,et al.  The central role of the propensity score in observational studies for causal effects , 1983 .

[5]  P. Holland Statistics and Causal Inference , 1985 .

[6]  Judea Pearl,et al.  The recovery of causal poly-trees from statistical data , 1987, Int. J. Approx. Reason..

[7]  Michael C. Horsch,et al.  Dynamic Bayesian networks , 1990 .

[8]  Gregory F. Cooper,et al.  A Bayesian Method for the Induction of Probabilistic Networks from Data , 1992 .

[9]  David Maxwell Chickering,et al.  Learning Bayesian Networks is NP-Complete , 2016, AISTATS.

[10]  Kevin Murphy,et al.  Bayes net toolbox for Matlab , 1999 .

[11]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[12]  Umeshwar Dayal,et al.  PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth , 2001, ICDE 2001.

[13]  C. Granger Investigating causal relations by econometric models and cross-spectral methods , 1969 .

[14]  Kevin Murphy,et al.  Dynamic Bayesian Networks , 2002 .

[15]  Robert Michael Lewis,et al.  A Globally Convergent Augmented Lagrangian Pattern Search Algorithm for Optimization with General Constraints and Simple Bounds , 2002, SIAM J. Optim..

[16]  Stuart J. Russell,et al.  Dynamic bayesian networks: representation, inference and learning , 2002 .

[17]  J. Beck,et al.  Bayesian Updating of Structural Models and Reliability using Markov Chain Monte Carlo Simulation , 2002 .

[18]  D. Wald,et al.  Homocysteine and cardiovascular disease: evidence on causality from a meta-analysis , 2002, BMJ : British Medical Journal.

[19]  Eamonn J. Keogh,et al.  A symbolic representation of time series, with implications for streaming algorithms , 2003, DMKD '03.

[20]  David Maxwell Chickering,et al.  Learning Bayesian Networks: The Combination of Knowledge and Statistical Data , 1994, Machine Learning.

[21]  Peter J. F. Lucas,et al.  Markov Equivalence in Bayesian Networks , 2004 .

[22]  Yan Liu,et al.  Temporal causal modeling with graphical granger methods , 2007, KDD '07.

[23]  K. Hlavácková-Schindler,et al.  Causality detection based on information-theoretic approaches in time series analysis , 2007 .

[24]  Fue-Sang Lien,et al.  Bayesian inference for source determination with applications to a complex urban environment , 2007 .

[25]  Sangil Leea,et al.  Source apportionment of PM 2 . 5 : Comparing PMF and CMB results for four ambient monitoring sites in the southeastern United States , 2008 .

[26]  Sangi Lee,et al.  Source apportionment of PM2.5: Comparing PMF and CMB results for four ambient monitoring sites in the southeastern United States , 2008 .

[27]  Yan Liu,et al.  Spatial-temporal causal modeling for climate change attribution , 2009, KDD.

[28]  A. Seth,et al.  Granger causality and transfer entropy are equivalent for Gaussian variables. , 2009, Physical review letters.

[29]  M. A. Gómez–Villegas,et al.  Dealing with uncertainty in Gaussian Bayesian networks from a regression perspective , 2010 .

[30]  Jürgen Kurths,et al.  Escaping the curse of dimensionality in estimating multivariate transfer entropy. , 2012, Physical review letters.

[31]  Yu Zheng,et al.  U-Air: when urban air quality inference meets big data , 2013, KDD.

[32]  Yi Deng,et al.  Causal Discovery from Spatio-Temporal Data with Applications to Climate Science , 2014, 2014 13th International Conference on Machine Learning and Applications.

[33]  Yong Yu,et al.  Inferring gas consumption and pollution emission of vehicles throughout a city , 2014, KDD.

[34]  Licia Capra,et al.  Urban Computing: Concepts, Methodologies, and Applications , 2014, TIST.

[35]  Lidan Shou,et al.  Splitter: Mining Fine-Grained Sequential Patterns in Semantic Trajectories , 2014, Proc. VLDB Endow..

[36]  Shou-De Lin,et al.  Inferring Air Quality for Station Location Recommendation Based on Urban Big Data , 2015, KDD.

[37]  Ramez Elmasri,et al.  Flood Prediction and Mining Influential Spatial Features on Future Flood with Causal Discovery , 2015, 2015 IEEE International Conference on Data Mining Workshop (ICDMW).

[38]  Jian Yang,et al.  Causal Inference via Sparse Additive Models with Application to Online Advertising , 2015, AAAI.

[39]  Xingying Zhang,et al.  The “APEC Blue” phenomenon: Regional emission control effects observed from space , 2015 .

[40]  Jiawei Han,et al.  Assembler: Efficient Discovery of Spatial Co-evolving Patterns in Massive Geo-sensory Data , 2015, KDD.

[41]  Yu Zheng,et al.  Methodologies for Cross-Domain Data Fusion: An Overview , 2015, IEEE Transactions on Big Data.

[42]  Victor O. K. Li,et al.  Granger-Causality-based air quality estimation with spatio-temporal (S-T) heterogeneous big data , 2015, 2015 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS).

[43]  Ming Li,et al.  Forecasting Fine-Grained Air Quality Based on Big Data , 2015, KDD.

[44]  Victor O. K. Li,et al.  A Gaussian Bayesian model to identify spatio-temporal causalities for air pollution based on urban big data , 2016, 2016 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS).

[45]  Luming Zhang,et al.  GMove: Group-Level Mobility Modeling Using Geo-Tagged Social Media , 2016, KDD.

[46]  Fang Chen,et al.  Discovering Congestion Propagation Patterns in Spatio-Temporal Traffic Data , 2017, IEEE Transactions on Big Data.