Making Big Data Small: Strategies to Expand Urban and Geographical Research Using Social Media

ABSTRACT While exciting, Big Data (particularly geotagged social media data) has proven difficult for many urbanists and social science researchers to use. As a partial solution, we propose a strategy that enables the fast extracting of only relevant data from large sets of geosocial data. While contrary to many Big Data approaches—in which analysis is done on the entire dataset—much productive social science work can use smaller datasets—around the same size as census or survey data—within standard methodological frameworks. The approach we outline in this paper—including the example of a fully operating system—offers a solution for urban researchers interested in these types of data but reluctant to personally build data science skills.

[1]  Huan Liu,et al.  Is the Sample Good Enough? Comparing Data from Twitter's Streaming API with Twitter's Firehose , 2013, ICWSM.

[2]  Andrew Crooks,et al.  Demarcating new boundaries: mapping virtual polycentric communities through social media content , 2013 .

[3]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[4]  Matthew Zook,et al.  Using Geotagged Digital Social Data in Geographic Research , 2014 .

[5]  Shaowen Wang,et al.  Mapping the global Twitter heartbeat: The geography of Twitter , 2013, First Monday.

[6]  Anuj R. Jaiswal,et al.  Analytics : Applications in Crisis Management , 2011 .

[7]  Daniel Arribas-Bel,et al.  Accidental, open and everywhere: Emerging data sources for the understanding of cities , 2014 .

[8]  Jay Christian #epidemiology: Ecological analysis of fast food tweets in relation to Behavioral Risk Factor Surveillance System data , 2014 .

[9]  Jorge Bernardino,et al.  NoSQL databases: MongoDB vs cassandra , 2013, C3S2E '13.

[10]  G. Bruce Berriman,et al.  Scientific workflow applications on Amazon EC2 , 2010, 2009 5th IEEE International Conference on E-Science Workshops.

[11]  V. Levenshtein Designs as maximum codes in polynomial metric spaces , 1992 .

[12]  Qunying Huang,et al.  A data-driven framework for archiving and exploring social media data , 2014, Ann. GIS.

[13]  Michael Zimmer,et al.  A topology of Twitter research: disciplines, methods, and ethics , 2014, Aslib J. Inf. Manag..

[14]  Matthew Zook,et al.  Artists and Bankers and Hipsters, Oh My! Mapping Tweets in the New York Metropolitan Region , 2014 .

[15]  Ate Poorthuis,et al.  Mapping communities in large virtual social networks: Using Twitter data to find the Indie Mac community , 2010, 2010 IEEE International Workshop on: Business Applications of Social Network Analysis (BASNA).

[16]  K. R. Klein Tracking a wildfire in areas of high relief using volunteered geographic information: A viewshed application , 2014 .

[17]  Alexandru Iosup,et al.  A Performance Analysis of EC2 Cloud Computing Services for Scientific Computing , 2009, CloudComp.

[18]  Ben Shneiderman,et al.  Analyzing Social Media Networks with NodeXL: Insights from a Connected World , 2010 .

[19]  Ate Poorthuis,et al.  Modeling User Behavior in Adoption and Diffusion of Twitter Clients , 2011, 2011 IEEE Third Int'l Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third Int'l Conference on Social Computing.

[20]  Wenwen Li,et al.  Using geolocated Twitter data to monitor the prevalence of healthy and unhealthy food references across the US , 2014 .

[21]  Mohammad Ali Abbasi,et al.  TweetTracker: An Analysis Tool for Humanitarian and Disaster Relief , 2011, ICWSM.

[22]  Stefano Spaccapietra,et al.  On Spatial Database Integration , 1998, Int. J. Geogr. Inf. Sci..

[23]  Monica Stephens From Geo-Social to Geo-Local: The Flows and Biases of Volunteered Geographic Information , 2012 .

[24]  D. Boyd,et al.  CRITICAL QUESTIONS FOR BIG DATA , 2012 .

[25]  Hideyuki Fujita,et al.  Geo-tagged Twitter collection and visualization system , 2013 .

[26]  Matthew Zook,et al.  Beyond the geotag: situating ‘big data’ and leveraging the potential of the geoweb , 2013 .

[27]  Jiawei Han,et al.  GeoMiner: a system prototype for spatial data mining , 1997, SIGMOD '97.

[28]  Clinton Gormley,et al.  Elasticsearch: The Definitive Guide , 2015 .

[29]  Lan Mu,et al.  GIS analysis of depression among Twitter users , 2015 .

[30]  R. Goodspeed The Limited Usefulness of Social Media and Digital Trace Data for Urban Social Research , 2013, Proceedings of the International AAAI Conference on Web and Social Media.

[31]  Itzhak Benenson,et al.  The Data Revolution: Big Data, Open Data, Data Infrastructures and their Consequences. By Rob Kitchin, London: Sage, 2014. , 2016 .

[32]  Lisa Schweitzer,et al.  Planning and Social Media: A Case Study of Public Transit and Stigma on Twitter , 2014 .

[33]  Shaowen Wang A CyberGIS Framework for the Synthesis of Cyberinfrastructure, GIS, and Spatial Analysis , 2010 .

[34]  Anthony Stefanidis,et al.  Geosocial gauge: a system prototype for knowledge discovery from social media , 2013, Int. J. Geogr. Inf. Sci..

[35]  Jin-Kyu Jung,et al.  Code clouds: Qualitative geovisualization of geotweets , 2015 .

[36]  Steven Levy,et al.  Hackers: Heroes of the Computer Revolution , 1984 .

[37]  Steve Vinoski,et al.  Advanced Message Queuing Protocol , 2006, IEEE Internet Computing.

[38]  Matthew Zook,et al.  Mapping Spaces: Cartographic Representations of Online Data , 2017 .

[39]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[40]  Matthew Zook,et al.  Small Stories in Big Data: Gaining Insights from Large Spatial Point Pattern Datasets , 2015 .

[41]  Matthew Zook,et al.  Mapping the Data Shadows of Hurricane Sandy: Uncovering the Sociospatial Dimensions of ‘Big Data’ , 2014 .

[42]  Marty Humphrey,et al.  A quantitative analysis of high performance computing with Amazon's EC2 infrastructure: The death of the local cluster? , 2009, 2009 10th IEEE/ACM International Conference on Grid Computing.

[43]  Matthew Zook,et al.  Social Media and the City: Rethinking Urban Socio-Spatial Inequality Using User-Generated Geographic Information , 2015 .

[44]  Doron Rotem,et al.  Random Sampling from Database Files: A Survey , 1990, SSDBM.

[45]  Jaroslav Pokorny NoSQL databases: a step to database scalability in web environment , 2011, iiWAS '11.

[46]  Cadey Korson,et al.  Political Agency and Citizen Journalism: Twitter as a Tool of Evaluation , 2015 .

[47]  J. Chris Anderson,et al.  CouchDB: The Definitive Guide , 2010 .

[48]  Luke S. Smith,et al.  Assessing the utility of social media as a data source for flood risk management using a real‐time modelling framework , 2017 .

[49]  Rob Kitchin,et al.  The data revolution : big data, open data, data infrastructures & their consequences , 2014 .

[50]  Matthew Zook Crowd-sourcing the smart city: Using big geosocial media metrics in urban governance , 2017, Big Data Soc..

[51]  Matthew Zook,et al.  The Technology of Religion: Mapping Religious Cyberscapes , 2012 .

[52]  Charles Anderson,et al.  The end of theory: The data deluge makes the scientific method obsolete , 2008 .

[53]  G. Bruce Berriman,et al.  An Evaluation of the Cost and Performance of Scientific Workflows on Amazon EC2 , 2012, Journal of Grid Computing.

[54]  Matthew Zook,et al.  Offline Brews and Online Views: Exploring the Geography of Beer Tweets , 2014 .

[55]  Toon De Pessemier,et al.  MovieTweetings: a movie rating dataset collected from twitter , 2013, RecSys 2013.

[56]  Daniel Z. Sui,et al.  Can Social Media Clear the Air? A Case Study of the Air Pollution Problem in Chinese Cities , 2015 .

[57]  Rick Cattell,et al.  Scalable SQL and NoSQL data stores , 2011, SGMD.

[58]  David W. S. Wong,et al.  Evaluating the “geographical awareness” of individuals: an exploratory analysis of twitter data , 2013, Cartography and Geographic Information Science.

[59]  Thomas J. Lampoltshammer,et al.  Exploring Twitter to Analyze the Public’s Reaction Patterns to Recently Reported Homicides in London , 2015, PloS one.

[60]  Kristina Chodorow,et al.  MongoDB: The Definitive Guide , 2010 .

[61]  M. Goodchild,et al.  Spatial, temporal, and socioeconomic patterns in the use of Twitter and Flickr , 2013 .

[62]  Robert G. Cromley,et al.  Evaluating geo-located Twitter data as a control layer for areal interpolation of population , 2015 .

[63]  Huan Liu,et al.  Twitter Data Analytics , 2013, SpringerBriefs in Computer Science.

[64]  David Stuart,et al.  The Data Revolution: Big Data, Open Data, Data Infrastructures and Their Consequences , 2015, Online Inf. Rev..

[65]  Daniel Z. Sui,et al.  Exploring the Intraurban Digital Divide Using Online Restaurant Reviews: A Case Study in Franklin County, Ohio , 2014 .

[66]  B. Schaefer SOCIAL MEDIA TO LOCATE URBAN DISPLACEMENT: ASSESSING THE RISK OF DISPLACEMENT USING VOLUNTEERED GEOGRAPHIC INFORMATION IN THE CITY OF LOS ANGELES , 2014 .

[67]  ReedBenjamin,et al.  Building a high-level dataflow system on top of Map-Reduce , 2009, VLDB 2009.