What You Can Scrape and What Is Right to Scrape: A Proposal for a Tool to Collect Public Facebook Data

In reaction to the Cambridge Analytica scandal, Facebook has restricted the access to its Application Programming Interface (API). This new policy has damaged the possibility for independent researchers to study relevant topics in political and social behavior. Yet, much of the public information that the researchers may be interested in is still available on Facebook, and can be still systematically collected through web scraping techniques. The goal of this article is twofold. First, we discuss some ethical and legal issues that researchers should consider as they plan their collection and possible publication of Facebook data. In particular, we discuss what kind of information can be ethically gathered about the users (public information), how published data should look like to comply with privacy regulations (like the GDPR), and what consequences violating Facebook’s terms of service may entail for the researcher. Second, we present a scraping routine for public Facebook posts, and discuss some technical adjustments that can be performed for the data to be ethically and legally acceptable. The code employs screen scraping to collect the list of reactions to a Facebook public post, and performs a one-way cryptographic hash function on the users’ identifiers to pseudonymize their personal information, while still keeping them traceable within the data. This article contributes to the debate around freedom of internet research and the ethical concerns that might arise by scraping data from the social web.

[1]  Daniela Richter,et al.  Pseudonymization of patient identifiers for translational research , 2013, BMC Medical Informatics and Decision Making.

[2]  Jean Burgess,et al.  The Politics of Twitter Data , 2013 .

[3]  Cornelius Puschmann An end to the wild west of social media research: a response to Axel Bruns , 2019, Information, Communication & Society.

[4]  M. Williams,et al.  Towards an Ethical Framework for Publishing Twitter Data in Social Research: Taking into Account Users’ Views, Online Context and Algorithmic Estimation , 2017, Sociology.

[5]  Casey Fiesler,et al.  “Participant” Perceptions of Twitter Research Ethics , 2018 .

[6]  D. Braun,et al.  Put in the spotlight or largely ignored? Emphasis on the Spitzenkandidaten by political parties in their online campaigns for European elections , 2019 .

[7]  E. Buchanan,et al.  Internet Research Ethics , 2012 .

[8]  T. Venturini,et al.  “API-Based Research” or How can Digital Sociology and Journalism Studies Learn from the Facebook and Cambridge Analytica Data Breach , 2019, Digital Journalism.

[9]  Anders Olof Larsson,et al.  Online, all the time? A quantitative assessment of the permanent campaign on Facebook , 2016, New Media Soc..

[10]  Rasha A. Abdulla,et al.  Protest leadership in the age of social media , 2016 .

[11]  Axel Bruns,et al.  After the ‘APIcalypse’: social media platforms and their fight against critical scholarly research , 2019, Information, Communication & Society.

[12]  Henning Müller,et al.  Strategies for health data exchange for secondary, cross-institutional clinical research , 2010, Comput. Methods Programs Biomed..

[13]  Yaniv Erlich,et al.  Routes for breaching and protecting genetic privacy , 2013, Nature Reviews Genetics.

[14]  Yannis Panagis,et al.  The Empire Strikes Back: Digital Control of Unfair Terms of Online Services , 2017 .

[15]  M. Zimmer “But the data is already public”: on the ethics of research in Facebook , 2010, Ethics and Information Technology.

[16]  Marco Gonzalez,et al.  Author's Personal Copy Social Networks Tastes, Ties, and Time: a New Social Network Dataset Using Facebook.com , 2022 .

[17]  Paolo Torroni,et al.  CLAUDETTE: an automated detector of potentially unfair clauses in online terms of service , 2018, Artificial Intelligence and Law.

[18]  Alexander Halavais Overcoming terms of service: a proposal for ethical distributed research , 2019, Information, Communication & Society.

[19]  T. Baar,et al.  Group Privacy in the Age of Big Data , 2017 .

[20]  Deen Freelon Computational Research in the Post-API Age , 2018, Political Communication.

[21]  Dan Mercea,et al.  The disinformation landscape and the lockdown of social platforms , 2019, Information, Communication & Society.

[22]  Rasha A. Abdulla,et al.  Facebook polls as proto-democratic instruments in the Egyptian revolution: The ‘We Are All Khaled Said’ Facebook page , 2018 .

[23]  Marco Loos,et al.  Wanted: a Bigger Stick. On Unfair Terms in Consumer Contracts with Online Service Providers , 2016 .

[24]  K. Crawford,et al.  Where are human subjects in Big Data research? The emerging ethics divide , 2016, Big Data Soc..

[25]  Michael Zimmer and Katharina Kinder-Kurlanda (eds), Internet Research Ethics for the Social Age: New Challenges, Cases, and Contexts , 2018, European Journal of Communication.

[26]  Nicolae V. Dură,et al.  International Covenant on Economic, Social and Cultural Rights , 1995, Essential Texts on Human Rights for the Police.

[27]  G. King,et al.  A New Model for Industry–Academic Partnerships , 2019, PS: Political Science & Politics.

[28]  M. Strohmaier,et al.  When populists become popular: comparing Facebook use by the right-wing movement Pegida and German political parties , 2017 .