Understanding Spatio-Temporal Urban Processes

Increasingly, decisions are based on insights and conclusions derived from the results of data analysis. Thus, determining the validity of these results is of paramount importance. In this paper, we take a step towards helping users identify potential issues in spatio-temporal data and thus gain trust in the results they derived from these data. We focus on processes that are captured by relationships among datasets that serve as the data exhaust for different components of urban environments. In this scenario, debugging data involves two important challenges: the inherent complexity of spatio-temporal data, and the number of possible relationships. We propose a framework for profiling spatio-temporal relationships that automatically identifies data slices that present a significant deviation from what is expected, and thus, helps focus a user’s attention on slices of the data that may have quality issues and/or that may affect the conclusions derived from the analysis’ results. We describe the profiling methodology and how it derives relationships, identifies candidate deviations, assesses their statistical significance, and measures their magnitude. We also present a series of cases studies using real datasets from New York City which demonstrate the usefulness of spatio-temporal profiling to build trust on data analysis’ results.

[1]  Arnab Nandi,et al.  Distributed and interactive cube exploration , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[2]  Juliana Freire,et al.  Data Polygamy: The Many-Many Relationships among Urban Spatio-Temporal Data Sets , 2016, SIGMOD Conference.

[3]  Adrien-Marie Legendre,et al.  Nouvelles méthodes pour la détermination des orbites des comètes , 1970 .

[4]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[5]  Mai H. Vu,et al.  An adaptive information-theoretic approach for identifying temporal correlations in big data sets , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[6]  K. Kunisaki,et al.  Simpson's paradox. , 2005, Critical care medicine.

[7]  Torben Bach Pedersen,et al.  AMIC: An Adaptive Information Theoretic Method to Identify Multi-Scale Temporal Correlations in Big Time Series Data , 2019, ArXiv.

[8]  Cláudio T. Silva,et al.  Exploring Traffic Dynamics in Urban Environments Using Vector‐Valued Functions , 2015, Comput. Graph. Forum.

[9]  Michael V. Mannino,et al.  Statistical profile estimation in database systems , 1988, CSUR.

[10]  Felix Naumann,et al.  Efficiently Detecting Inclusion Dependencies , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[11]  Martin L. Kersten,et al.  Meet Charles, big data query advisor , 2013, CIDR.

[12]  Jean-Marc Petit,et al.  Discovering interesting inclusion dependencies: application to logical database tuning , 2002, Inf. Syst..

[13]  Carolo Friederico Gauss Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientium , 2014 .

[14]  Howard J. Hamilton,et al.  Mining functional dependencies from data , 2007, Data Mining and Knowledge Discovery.

[15]  K. Pearson VII. Note on regression and inheritance in the case of two parents , 1895, Proceedings of the Royal Society of London.

[16]  Cláudio T. Silva,et al.  Querying and Exploring Polygamous Relationships in Urban Spatio-Temporal Data Sets , 2017, SIGMOD Conference.

[17]  Felix Naumann,et al.  A Machine Learning Approach to Foreign Key Discovery , 2009, WebDB.

[18]  Samuel Madden,et al.  MacroBase: Prioritizing Attention in Fast Data , 2016, SIGMOD Conference.

[19]  Pawel Lewicki,et al.  Statistics : methods and applications : a comprehensive reference for science, industry, and data mining , 2006 .

[20]  Joseph M. Hellerstein,et al.  Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[21]  J. Shaffer Multiple Hypothesis Testing , 1995 .

[22]  N. L. Johnson,et al.  Linear Statistical Inference and Its Applications , 1966 .

[23]  Sunita Sarawagi,et al.  i3: intelligent, interactive investigation of OLAP data cubes , 2000, SIGMOD '00.

[24]  Alexandra Meliou,et al.  Data X-Ray: A Diagnostic Tool for Data Errors , 2015, SIGMOD Conference.

[25]  Sunita Sarawagi,et al.  Explaining Differences in Multidimensional Aggregates , 1999, VLDB.

[26]  Olga Papaemmanouil,et al.  Explore-by-example: an automatic query steering framework for interactive data exploration , 2014, SIGMOD Conference.

[27]  M Arumugam Students 't' Test , 1981 .

[28]  C. Spearman The proof and measurement of association between two things. , 2015, International journal of epidemiology.