Media Cloud: Massive Open Source Collection of Global News on the Open Web

We present the first full description of Media Cloud, an open source platform based on crawling hyperlink structure in operation for over 10 years, that for many uses will be the best way to collect data for studying the media ecosystem on the open web. We document the key choices behind what data Media Cloud collects and stores, how it processes and organizes these data, and its open API access as well as userfacing tools. We also highlight the strengths and limitations of the Media Cloud collection strategy compared to relevant alternatives. We give an overview two sample datasets generated using Media Cloud and discuss how researchers can use the platform to create their own datasets.

[1]  Bruce Bimber,et al.  Finding News Stories: A Comparison of Searches Using Lexisnexis and Google News , 2008 .

[2]  Hans-Jürgen Engelbrecht The wealth of networks: How social production transforms markets and freedom , 2006 .

[3]  Yasmine Rubinovitz News Matter : embedding human intuition in machine intelligence through interactive data visualizations , 2017 .

[4]  H. Roberts,et al.  Social Mobilization and the Networked Public Sphere: Mapping the SOPA-PIPA Debate , 2015 .

[5]  David Deacon Yesterday’s Papers and Today’s Technology , 2007 .

[6]  Richard A. Rogers,et al.  Doing Digital Methods , 2019 .

[7]  Gregor Leban,et al.  Modelling of temporal fluctuation scaling in online news network with independent cascade model , 2018, Physica A: Statistical Mechanics and its Applications.

[8]  Jisun An,et al.  Understanding News Geography and Major Determinants of Global News Coverage of Disasters , 2014, ArXiv.

[9]  C. Freifeld,et al.  Fentanyl panic goes viral: The spread of misinformation about overdose risk from casual contact with fentanyl in mainstream and social media , 2020, International Journal of Drug Policy.

[10]  Ethan Zuckerman,et al.  Whose Death Matters? A Quantitative Analysis of Media Attention to Deaths of Black Americans in Police Confrontations, 2013–2016 , 2019 .

[11]  Marko Grobelnik,et al.  News Across Languages - Cross-Lingual Document Similarity and Event Tracking , 2015, J. Artif. Intell. Res..

[12]  Jisun An,et al.  A First Look at Global News Coverage of Disasters by Using the GDELT Dataset , 2014, SocInfo.

[13]  Alexandre Gonçalves Conflicting Frames : the dispute over the meaning of rolezinhos in Brazilian media , 2014 .

[14]  D. Lazer,et al.  Growing pains for global monitoring of societal events , 2016, Science.

[15]  H. Roberts,et al.  Mail-In Voter Fraud: Anatomy of a Disinformation Campaign , 2020, SSRN Electronic Journal.

[16]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[17]  Sands A. Fish,et al.  Digital Health Communication and Global Public Influence: A Study of the Ebola Epidemic , 2017, Journal of health communication.

[18]  N. Marres,et al.  Subsuming the ground: how local realities of the Fergana Valley, the Narmada Dams and the BTC pipeline are put to use on the Web , 2008 .

[19]  Ethan Zuckerman Global Attention Profiles - a Working Paper: First Steps Towards a Quantitative Approach to the Study of Media Attention , 2003 .

[20]  M. Fishman,et al.  Crime Waves as Ideology , 1978 .

[21]  Eugenia Siapera,et al.  Multiculturalism, progressive politics, and British Islam online , 2007 .

[22]  Richard Rogers Issuecrawling : Building lists of URLs and mapping website networks , 2018 .

[23]  A. Fronzetti Colladon,et al.  Brand Intelligence Analytics , 2019, ArXiv.

[24]  Jeremy Blackburn,et al.  The Pushshift Reddit Dataset , 2020, ICWSM.

[25]  Erhardt Graeff,et al.  The battle for 'Trayvon Martin': Mapping a media controversy online and off-line , 2014, First Monday.

[26]  Lee Wilkins Deciding What's News: A Study of CBS Evening News, NBC Nightly News, Newsweek, and Time , 2005 .

[27]  C. D’Ignazio Feminicide & Machine Learning: Detecting Gender-based Violence to Strengthen Civil Sector Activism , 2020 .

[28]  Jisun An,et al.  Two Tales of the World: Comparison of Widely Used World News Datasets GDELT and EventRegistry , 2016, ICWSM.

[29]  Ethan Zuckerman,et al.  CLIFF-CLAVIN : Determining Geographic Focus for News Articles [ Extended Abstract ] , 2014 .

[30]  B. Seymour,et al.  Community water fluoridation online: an analysis of the digital media ecosystem. , 2018, Journal of public health dentistry.

[31]  Gregory Brazeal How Much Does a Belief Cost?: Revisiting the Marketplace of Ideas , 2012 .

[32]  Ruth McNally,et al.  Sociomics! Using the IssueCrawler to map, monitor and engage with the global proteomics research network , 2005, Proteomics.

[33]  P. Parks,et al.  The news media , 2002 .

[34]  Ethan Zuckerman,et al.  Rewire: Digital Cosmopolitans in the Age of Connection , 2013 .

[35]  Axel Bruns,et al.  Methodologies for mapping the political blogosphere: An exploration using the IssueCrawler research tool , 2007, First Monday.

[36]  Sibel Adali,et al.  Sampling the News Producers: A Large News and Feature Data Set for the Study of the Complex Media Landscape , 2018, ICWSM.

[37]  H. Gans Deciding What's News: A Study of CBS Evening News, NBC Nightly News, Newsweek and Time , 1979 .

[38]  F. Cook,et al.  The Journalism of Outrage : Investigative Reporting and Agenda Building in America , 1991 .

[39]  Joshua A. Tucker,et al.  Social Media, Political Polarization, and Political Disinformation: A Review of the Scientific Literature , 2018 .

[40]  Marko Grobelnik,et al.  Event registry: learning about world events from news , 2014, WWW.

[41]  Richard Rogers Mapping public web space with the Issuecrawler , 2010 .

[42]  Jock Given,et al.  The wealth of networks: How social production transforms markets and freedom , 2007, Inf. Econ. Policy.

[43]  Studying the news on public health: how content analysis supports media advocacy. , 2003, American journal of health behavior.

[44]  N. Marres,et al.  Landscaping climate change: a mapping technique for understanding science and technology debates on the World Wide Web , 2000 .

[45]  D. Mladenic,et al.  A Data set for Information Spreading over the News , 2020 .

[46]  J. Dijck The Culture of Connectivity: A Critical History of Social Media , 2013 .

[47]  H. Schmid-Petri Politicization of science: how climate change skeptics use experts and scientific evidence in their online communication , 2017, Climatic Change.

[48]  J. Kayser One week's news : comparative of 17 major dailies for a seven-day period , 1953 .

[49]  M. Trampus,et al.  INTERNALS OF AN AGGREGATED WEB NEWS FEED , 2012 .