The representativeness and spatial bias of volunteered geographic information: a review

ABSTRACT Many applications of volunteered geographic information (VGI) involve inferring the properties of the underlying population from a sample consisting of VGI observations, i.e. VGI sample. The representativeness of VGI sample is crucial for deciding the fitness for use of VGI in such applications. Due to the volunteers’ opportunistic observation efforts, spatial distribution of VGI observations is often biased (i.e. spatial bias). This degrades the representativeness of VGI and impedes the quality of inference made from VGI. Extensive research has been conducted on assessing or assuring VGI quality from the perspective of the fundamental dimensions of spatial data quality. Yet, this perspective alone provides limited insights on the representativeness of VGI. Assessing VGI representativeness and developing novel approaches to accounting for spatial bias in VGI is in need for broadening the spectrum of VGI applications. This article offers a comprehensive survey of the scientific literature from various domains (ecology, statistics, machine learning, etc.) to summarize existing endeavors related to sample representativeness assessment and sample selection bias correction for enlightening the treatment of these issues in VGI applications.

[1]  Ahmed Loai Ali,et al.  Data Quality Assurance for Volunteered Geographic Information , 2014, GIScience.

[2]  Christopher Winship,et al.  Models for Sample Selection Bias , 1992 .

[3]  James Parker,et al.  on Knowledge and Data Engineering, , 1990 .

[4]  Hansi Senaratne,et al.  A review of volunteered geographic information quality assessment methods , 2017, Int. J. Geogr. Inf. Sci..

[5]  Frederick Mosteller,et al.  Representative Sampling, II: Scientific Literature, Excluding Statistics , 1979 .

[6]  A. Peterson,et al.  New developments in museum-based informatics and applications in biodiversity analysis. , 2004, Trends in ecology & evolution.

[7]  Sixten Lundström,et al.  Estimation in Surveys with Nonresponse: Särndal/Estimation in Surveys with Nonresponse , 2005 .

[8]  Robert A. Boria,et al.  Spatial filtering to reduce sampling bias can improve the performance of ecological niche models , 2014 .

[9]  Pedro J. Leitão,et al.  Effects of geographical data sampling bias on habitat models of species distributions: a case study with steppe birds in southern Portugal , 2011, Int. J. Geogr. Inf. Sci..

[10]  S. Elwood Volunteered geographic information: key questions, concepts and methods to guide emerging research and practice , 2008 .

[11]  Patrick Weber,et al.  OpenStreetMap: User-Generated Street Maps , 2008, IEEE Pervasive Computing.

[12]  Robert P. Anderson,et al.  Maximum entropy modeling of species geographic distributions , 2006 .

[13]  M. Haklay How Good is Volunteered Geographical Information? A Comparative Study of OpenStreetMap and Ordnance Survey Datasets , 2010 .

[14]  M. Goodchild,et al.  Data-driven geography , 2014, GeoJournal.

[15]  Andreas Karlsson,et al.  Estimation in Surveys with Nonresponse , 2007, Technometrics.

[16]  R. Kadmon,et al.  EFFECT OF ROADSIDE BIAS ON THE ACCURACY OF PREDICTIVE MAPS PRODUCED BY BIOCLIMATIC MODELS , 2004 .

[17]  Jennifer A. Miller,et al.  Mapping Species Distributions: Spatial Inference and Prediction , 2010 .

[18]  George Strawn Data-Intensive Science , 2016, IT Professional.

[19]  W. Ponder,et al.  Evaluation of Museum Collection Data for Use in Biodiversity Assessment , 2001 .

[20]  T. Gregoire,et al.  Sampling Strategies for Natural Resources and the Environment , 2004 .

[21]  Guillaume Touya,et al.  Quality Assessment of the French OpenStreetMap Dataset , 2010, Trans. GIS.

[22]  Boris Schröder,et al.  The importance of correcting for sampling bias in MaxEnt species distribution models , 2013 .

[23]  Steve Kelling,et al.  Data-Intensive Science: A New Paradigm for Biodiversity Studies , 2009 .

[24]  K. Davis,et al.  The MODIS (Collection V005) BRDF/albedo product: Assessment of spatial representativeness over forested landscapes , 2009 .

[25]  J Hunt Assuring quality. , 1987, Nursing times.

[26]  I. Rahwan,et al.  Verification in Referral-Based Crowdsourcing , 2012, PloS one.

[27]  Brian L. Sullivan,et al.  eBird: A citizen-based bird observation network in the biological sciences , 2009 .

[28]  Brent J. Hecht,et al.  A Tale of Cities: Urban Biases in Volunteered Geographic Information , 2014, ICWSM.

[29]  Jelke Bethlehem,et al.  Selection Bias in Web Surveys , 2010 .

[30]  S. Reddy,et al.  Geographical sampling bias and its implications for conservation priorities in Africa , 2003 .

[31]  A. Zipf,et al.  A Comparative Study of Proprietary Geodata and Volunteered Geographic Information for Germany , 2010 .

[32]  D. J. Brus,et al.  Sampling for Natural Resource Monitoring , 2006 .

[33]  J. Silvertown A new dawn for citizen science. , 2009, Trends in ecology & evolution.

[34]  Paul A. Longley,et al.  Geo-temporal Twitter demographics , 2016, Int. J. Geogr. Inf. Sci..

[35]  J. Heckman Sample selection bias as a specification error , 1979 .

[36]  María B. García,et al.  A Novel Method to Handle the Effect of Uneven Sampling Effort in Biodiversity Databases , 2013, PloS one.

[37]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[38]  A. Zhu,et al.  Can citizen science assist digital soil mapping , 2015 .

[39]  Anthony Stefanidis,et al.  Assessing Completeness and Spatial Error of Features in Volunteered Geographic Information , 2013, ISPRS Int. J. Geo Inf..

[40]  Frederick Mosteller,et al.  Representative Sampling, III: The Current Statistical Literature , 1979 .

[41]  M. Goodchild Citizens as sensors: the world of volunteered geography , 2007 .

[42]  B. Minasny,et al.  On digital soil mapping , 2003 .

[43]  Jürgen Pfeffer,et al.  Population Bias in Geotagged Tweets , 2015, Proceedings of the International AAAI Conference on Web and Social Media.

[44]  Tim Sutton,et al.  How Global Is the Global Biodiversity Information Facility? , 2007, PloS one.

[45]  A-Xing Zhu Research Issues on Uncertainty in Geographic Data and GIS-Based Analysis , 2004 .

[46]  Michael F. Goodchild,et al.  Assuring the quality of volunteered geographic information , 2012 .

[47]  Tao Pei,et al.  A citizen data-based approach to predictive mapping of spatial variation of natural phenomena , 2015, Int. J. Geogr. Inf. Sci..

[48]  Z. Huaman,et al.  Assessing the Geographic Representativeness of Genebank Collections: the Case of Bolivian Wild Potatoes , 2000, Conservation biology : the journal of the Society for Conservation Biology.

[49]  Luis M. Carrascal,et al.  BIAS IN AVIAN SAMPLING EFFORT DUE TO HUMAN PREFERENCES: AN ANALYSIS WITH CATALONIAN BIRDS (1900 - 2002) , 2006 .

[50]  Alexander Zipf,et al.  Temporal Analysis on Contribution Inequality in OpenStreetMap: A Comparative Study for Four Countries , 2016, ISPRS Int. J. Geo Inf..

[51]  Miroslav Dudík,et al.  Correcting sample selection bias in maximum entropy density estimation , 2005, NIPS.

[52]  Michael F. Goodchild,et al.  The quality of big (geo)data , 2013 .

[53]  Dan Watt,et al.  Quality Assessment , 2009, Encyclopedia of Database Systems.

[54]  Alexander Zipf,et al.  A taxonomy of quality assessment methods for volunteered and crowdsourced geographic information , 2018, Trans. GIS.

[55]  Lin Yang,et al.  An integrative hierarchical stepwise sampling strategy for spatial sampling and its application in digital soil mapping , 2011, Int. J. Geogr. Inf. Sci..

[56]  H. Shimodaira,et al.  Improving predictive inference under covariate shift by weighting the log-likelihood function , 2000 .

[57]  A-Xing Zhu,et al.  Assessing the representativeness of the AmeriFlux network using MODIS and GOES data , 2008 .

[58]  Huan Liu,et al.  Is the Sample Good Enough? Comparing Data from Twitter's Streaming API with Twitter's Firehose , 2013, ICWSM.

[59]  Lucy Bastin,et al.  Usability of VGI for validation of land cover maps , 2015, Int. J. Geogr. Inf. Sci..

[60]  Robert P. Anderson,et al.  Environmental filters reduce the effects of sampling bias and improve predictions of ecological niche models , 2014 .

[61]  Steffen Fritz,et al.  Assessing the Accuracy of Volunteered Geographic Information arising from Multiple Contributors to an Internet Based Collaborative Project , 2013, Trans. GIS.

[62]  Christoph Perger,et al.  Using control data to determine the reliability of volunteered geographic information about land cover , 2013, Int. J. Appl. Earth Obs. Geoinformation.

[63]  Steven J. Phillips,et al.  Sample selection bias and presence-only distribution models: implications for background and pseudo-absence data. , 2009, Ecological applications : a publication of the Ecological Society of America.

[64]  Bianca Zadrozny,et al.  Learning and evaluating classifiers under sample selection bias , 2004, ICML.

[65]  Mehryar Mohri,et al.  Sample Selection Bias Correction Theory , 2008, ALT.

[66]  Andrej Vckovski,et al.  CHAPTER FIVE – Completeness , 1995 .

[67]  Budiman Minasny,et al.  A conditioned Latin hypercube method for sampling in the presence of ancillary information , 2006, Comput. Geosci..

[68]  A-Xing Zhu,et al.  Enabling point pattern analysis on spatial big data using cloud computing: optimizing and accelerating Ripley’s K function , 2016, Int. J. Geogr. Inf. Sci..

[69]  S. Gorman,et al.  Volunteered Geographic Information and Crowdsourcing Disaster Relief: A Case Study of the Haitian Earthquake , 2010 .

[70]  Maggi Kelly,et al.  Which ‘public'? Sampling effects in public participation GIS (PPGIS) and volunteered geographic information (VGI) systems for public lands management , 2014 .

[71]  D. Fink,et al.  Spatiotemporal exploratory models for broad-scale survey data. , 2010, Ecological applications : a publication of the Ecological Society of America.

[72]  T. Snäll,et al.  Evaluating citizen-based presence data for bird monitoring , 2011 .

[73]  M. Goodchild,et al.  Spatial, temporal, and socioeconomic patterns in the use of Twitter and Flickr , 2013 .