Open data products-A framework for creating valuable analysis ready data

This paper develops the notion of “open data product”. We define an open data product as the open result of the processes through which a variety of data (open and not) are turned into accessible information through a service, infrastructure, analytics or a combination of all of them, where each step of development is designed to promote open principles. Open data products are born out of a (data) need and add value beyond simply publishing existing datasets. We argue that the process of adding value should adhere to the principles of open (geographic) data science, ensuring openness, transparency and reproducibility. We also contend that outreach, in the form of active communication and dissemination through dashboards, software and publication are key to engage end-users and ensure societal impact. Open data products have major benefits. First, they enable insights from highly sensitive, controlled and/or secure data which may not be accessible otherwise. Second, they can expand the use of commercial and administrative data for the public good leveraging on their high temporal frequency and geographic granularity. We also contend that there is a compelling need for open data products as we experience the current data revolution. New, emerging data sources are unprecedented in temporal frequency and geographical resolution, but they are large, unstructured, fragmented and often hard to access due to privacy and confidentiality concerns. By transforming raw (open or “closed”) data into ready to use open data products, new dimensions of human geographical processes can be captured and analysed, as we illustrate with existing examples. We conclude by arguing that several parallels exist between the role that open source software played in enabling research on spatial analysis in the 90 s and early 2000s, and the opportunities that open data products offer to unlock the potential of new forms of (geo-)data.

[1]  Kohske Takahashi,et al.  Welcome to the Tidyverse , 2019, J. Open Source Softw..

[2]  David P. Roy,et al.  Analysis Ready Data: Enabling Analysis of the Landsat Archive , 2018, Remote. Sens..

[3]  Francisco Rowe,et al.  A Hierarchical Urban Forest Index Using Street-Level Imagery and Deep Learning , 2019, Remote. Sens..

[4]  Michelle A Morris,et al.  How has big data contributed to obesity research? A review of the literature , 2018, International Journal of Obesity.

[5]  Chris Brunsdon,et al.  Opening practice: supporting reproducibility and critical spatial data science , 2020, Journal of Geographical Systems.

[6]  A. Singleton,et al.  Developing an openly accessible multi‐dimensional small area index of ‘Access to Healthy Assets and Hazards’ for Great Britain, 2016 , 2018, Health & place.

[7]  Paul A. Longley,et al.  Creating the 2011 area classification for output areas (2011 OAC) , 2016, J. Spatial Inf. Sci..

[8]  Francisco Rowe,et al.  A Scalable Analytical Framework for Spatio-Temporal Analysis of Neighborhood Change: A Sequence Analysis Approach , 2019, AGILE Conference.

[9]  S. Spielman,et al.  Studying Neighborhoods Using Uncertain Data from the American Community Survey: A Contextual Approach , 2015 .

[10]  Chris Brunsdon,et al.  Establishing a framework for Open Geographic Information science , 2016, Int. J. Geogr. Inf. Sci..

[11]  R. Dahlstrom,et al.  Challenges and opportunities , 2021, Foundations of a Sustainable Economy.

[12]  Zhe Zhu Science of Landsat Analysis Ready Data , 2019 .

[13]  R. Bivand Spatial Dependence: Weighting Schemes, Statistics and Models , 2015 .

[14]  Zhe Zhu,et al.  Science of Landsat Analysis Ready Data , 2019, Remote. Sens..

[16]  P. Longley,et al.  Data infrastructure requirements for new geodemographic classifications: The example of London's workplace zones , 2019, Applied Geography.

[17]  Robert Weibel,et al.  Geographic Data Science , 2017, IEEE Computer Graphics and Applications.

[18]  Daniel Arribas-Bel Accidental, open and everywhere: Emerging data sources for the understanding of cities , 2014 .

[19]  Roger Burrows,et al.  The Predictive Postcode: The Geodemographic Classification of British Society , 2018 .

[20]  Francesco Rullani,et al.  Skills, Division of Labor and Performance in Collective Inventions. Evidence from the Open Source Software , 2004 .

[21]  D. Hand Statistical challenges of administrative and transaction data , 2018 .

[22]  Dani Arribas-Bel,et al.  Geography and computers: Past, present, and future , 2018, Geography Compass.

[23]  M. Haklay Citizen Science and Volunteered Geographic Information: Overview and Typology of Participation , 2013 .

[24]  Sergio J. Rey,et al.  PySAL: A Python Library of Spatial Analytical Methods , 2010 .

[25]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[26]  P. Rees,et al.  Creating the UK National Statistics 2001 output area classification , 2007 .

[27]  David J. Martin,et al.  Origin-destination geodemographics for analysis of travel to work flows , 2018, Comput. Environ. Urban Syst..

[28]  Harvey Goldstein,et al.  Challenges in administrative data linkage for research , 2017, Big Data Soc..

[29]  Youngihn Kho,et al.  GeoDa: An Introduction to Spatial Data Analysis , 2006 .

[30]  Denisa Rodila,et al.  Building an Earth Observations Data Cube: lessons learned from the Swiss Data Cube (SDC) on generating Analysis Ready Data (ARD) , 2017 .

[31]  A. Páez,et al.  A Spatio‐Temporal Analysis of the Environmental Correlates of COVID‐19 Incidence in Spain , 2020, Geographical analysis.

[32]  Alex Singleton,et al.  Geodemographics, visualisation, and social networks in applied geography , 2009 .

[33]  J. M. Casado-Díaz,et al.  An evolutionary approach to the delimitation of labour market areas: an empirical application for Chile , 2017 .

[34]  Sierdjan Koster,et al.  Fueling Research Transparency: Computational Notebooks and the Discussion Section , 2020 .

[35]  R. Peng Reproducible Research in Computational Science , 2011, Science.

[36]  A. Renedo,et al.  The co-production of what? Knowledge, values, and social relations in health care , 2017, PLoS biology.

[37]  Martin Hilbert,et al.  The World’s Technological Capacity to Store, Communicate, and Compute Information , 2011, Science.

[38]  Yannis Charalabidis,et al.  Benefits, Adoption Barriers and Myths of Open Data and Open Government , 2012, Inf. Syst. Manag..

[39]  Peter A. Johnson,et al.  The Cost(s) of Geospatial Open Data , 2017, Trans. GIS.

[40]  Ilkay Altintas,et al.  Ten simple rules for writing and sharing computational analyses in Jupyter Notebooks , 2019, PLoS Comput. Biol..

[41]  L. Manea,et al.  Data Resource Profile: COVerAGE-DB: a global demographic database of COVID-19 cases and deaths , 2021, International Journal of Epidemiology.

[42]  David Stuart,et al.  The Data Revolution: Big Data, Open Data, Data Infrastructures and Their Consequences , 2015, Online Inf. Rev..

[43]  P. Longley Geographical Information Systems: a renaissance of geodemographics for public service delivery , 2005 .

[44]  Daniel Arribas-Bel,et al.  Policy Brief: Neighbourhood Change and Trajectories of Inequality in Britain, 1971-2011 , 2019 .

[45]  Daniel Arribas-Bel,et al.  The Potential of Notebooks for Scientific Publication, Reproducibility and Dissemination , 2020, REGION.

[46]  Bram Klievink,et al.  Creating value through data collaboratives , 2018, Inf. Polity.

[47]  Nadia Bhuiyan,et al.  A framework for successful new product development , 2011 .

[48]  Sandeep Krishnamurthy,et al.  Cave or Community? An Empirical Examination of 100 Mature Open Source Projects , 2002, First Monday.

[49]  D. Donoho 50 Years of Data Science , 2017 .

[50]  Seth E. Spielman,et al.  The Past, Present, and Future of Geodemographic Research in the United States and United Kingdom , 2014, The Professional geographer : the journal of the Association of American Geographers.

[51]  Xiao-Li Meng,et al.  Statistical paradises and paradoxes in big data (I): Law of large populations, big data paradox, and the 2016 US presidential election , 2018, The Annals of Applied Statistics.

[52]  C. Dunn Participatory GIS — a people's GIS? , 2007 .

[53]  Jennifer C Molloy,et al.  The Open Knowledge Foundation: Open Data Means Better Science , 2011, PLoS biology.

[54]  E. Ostrom Crossing the great divide: Coproduction, synergy, and development , 1996 .

[55]  Bruce Alberts,et al.  Making Data Maximally Available , 2011, Science.

[56]  Alex Singleton,et al.  Geographers Count: A Report on Quantitative Methods in Geography , 2014 .