A qualitative and quantitative comparison between Web scraping and API methods for Twitter credibility analysis

Purpose This paper aims to perform an exhaustive revision of relevant and recent related studies, which reveals that both extraction methods are currently used to analyze credibility on Twitter. Thus, there is clear evidence of the need of having different options to extract different data for this purpose. Nevertheless, none of these studies perform a comparative evaluation of both extraction techniques. Moreover, the authors extend a previous comparison, which uses a recent developed framework that offers both alternates of data extraction and implements a previously proposed credibility model, by adding a qualitative evaluation and a Twitter-Application Programming Interface (API) performance analysis from different locations. Design/methodology/approach As one of the most popular social platforms, Twitter has been the focus of recent research aimed at analyzing the credibility of the shared information. To do so, several proposals use either Twitter API or Web scraping to extract the data to perform the analysis. Qualitative and quantitative evaluations are performed to discover the advantages and disadvantages of both extraction methods. Findings The study demonstrates the differences in terms of accuracy and efficiency of both extraction methods and gives relevance to much more problems related to this area to pursue true transparency and legitimacy of information on the Web. Originality/value Results report that some Twitter attributes cannot be retrieved by Web scraping. Both methods produce identical credibility values when a robust normalization process is applied to the text (i.e. tweet). Moreover, concerning the time performance, Web scraping is faster than Twitter API and it is more flexible in terms of obtaining data; however, Web scraping is very sensitive to website changes. Additionally, the response time of the Twitter API is proportional to the distance from the central server at San Francisco.

[1]  Scott Counts,et al.  Identifying topical authorities in microblogs , 2011, WSDM '11.

[2]  Meiliana,et al.  Social Media Web Scraping using Social Media Developers API and Regex , 2019, Procedia Computer Science.

[3]  Filippo Menczer,et al.  Arming the public with AI to counter social bots , 2019, ArXiv.

[4]  Muhammad Al-Qurishi,et al.  CredFinder: A real-time tweets credibility assessing system , 2016, 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[5]  Feng Xiao,et al.  Twitter User Rank Using Keyword Search , 2012, European-Japanese Conference on Information Modelling and Knowledge Bases.

[6]  Hend Suliman Al-Khalifa,et al.  An experimental system for measuring the credibility of news content in Twitter , 2011, Int. J. Web Inf. Syst..

[7]  Norman W. Paton,et al.  Crowd-sourced Targeted Feedback Collection for Multi-Criteria Data Source Selection , 2018 .

[8]  J. Fernando Sánchez-Rada,et al.  Social context in sentiment analysis: Formal definition, overview of current trends and framework for comparison , 2019, Inf. Fusion.

[9]  Irvin Dongo,et al.  Web Scraping versus Twitter API: A Comparison for a Credibility Analysis , 2020, iiWAS.

[10]  Masoud Rahgozar,et al.  OLFinder: Finding opinion leaders in online social networks , 2016, J. Inf. Sci..

[11]  Muhammad Al-Qurishi,et al.  A Credibility Analysis System for Assessing Information on Twitter , 2018, IEEE Transactions on Dependable and Secure Computing.

[12]  Ponnurangam Kumaraguru,et al.  CbI: Improving Credibility of User-Generated Content on Facebook , 2018, BDA.

[13]  Haewoon Kwak,et al.  Finding influentials based on the temporal order of information adoption in twitter , 2010, WWW '10.

[14]  Chaowei Phil Yang,et al.  A Twitter Data Credibility Framework - Hurricane Harvey as a Use Case , 2019, ISPRS Int. J. Geo Inf..

[15]  Yudith Cardinale,et al.  Credibility Analysis for Available Information Sources on the Web: A Review and a Contribution , 2019, 2019 4th International Conference on System Reliability and Safety (ICSRS).

[16]  Anália Lourenço,et al.  Web scraping technologies in an API world , 2014, Briefings Bioinform..

[17]  Muhammad Ali Ramdhani,et al.  Web Scraping and Naïve Bayes Classification for Job Search Engine , 2018 .

[18]  Filippo Menczer,et al.  Prevalence of Low-Credibility Information on Twitter During the COVID-19 Outbreak , 2020, ICWSM Workshops.

[19]  Yudith Cardinale,et al.  T-CREo: A Twitter Credibility Analysis Framework , 2021, IEEE Access.

[20]  Setsuo Tsuruta,et al.  High Precision Credibility Analysis of Information on Twitter , 2013, 2013 International Conference on Signal-Image Technology & Internet-Based Systems.

[21]  K. Canini,et al.  Finding Relevant Sources in Twitter Based on Content and Social Structure , 2010 .

[22]  Martine De Cock,et al.  Ranking Approaches for Microblog Search , 2010, 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[23]  Fabián Riquelme,et al.  Measuring user influence on Twitter: A survey , 2015, Inf. Process. Manag..

[24]  Adrian Iftene,et al.  A Real-Time System for Credibility on Twitter , 2020, LREC.

[25]  RiquelmeFabián,et al.  Measuring user influence on Twitter , 2016 .

[26]  Xiaomo Liu,et al.  Real-time Rumor Debunking on Twitter , 2015, CIKM.

[27]  Barbara Poblete,et al.  Information credibility on twitter , 2011, WWW.

[28]  B. Kusumasari,et al.  Scraping social media data for disaster communication: how the pattern of Twitter users affects disasters in Asia and the Pacific , 2020, Natural Hazards.

[29]  Min Huang,et al.  Topology-based Algorithm for Users' Influence on Specific Topics in Micro-blog ⋆ , 2013 .

[30]  Giancarlo Fortino,et al.  Credibility in Online Social Networks: A Survey , 2019, IEEE Access.

[31]  Héctor M. Pérez Meana,et al.  A Web Scraping Methodology for Bypassing Twitter API Restrictions , 2018, ArXiv.

[32]  Stephanie Edgerly,et al.  The Blue Check of Credibility: Does Account Verification Matter When Evaluating News on Twitter? , 2019, Cyberpsychology Behav. Soc. Netw..

[33]  Emil Robert Kaburuan,et al.  A Model Configuration of Social Media Text Mining for Projecting the Online-Commerce Transaction (Case: Twitter Tweets Scraping) , 2019, 2019 7th International Conference on Cyber and IT Service Management (CITSM).

[34]  Filippo Menczer,et al.  Hoaxy: A Platform for Tracking Online Misinformation , 2016, WWW.

[35]  Mohand Boughanem,et al.  Active Microbloggers: Identifying Influencers, Leaders and Discussers in Microblogging Networks , 2012, SPIRE.

[36]  Hernán A. Makse,et al.  CUNY Academic Works , 2022 .

[37]  Deen Freelon Computational Research in the Post-API Age , 2018, Political Communication.

[38]  Daniel M. Romero,et al.  Influence and passivity in social media , 2010, ECML/PKDD.

[39]  Micah Sherr,et al.  Does Being Verified Make You More Credible?: Account Verification's Effect on Tweet Credibility , 2019, CHI.

[40]  Savvas Zannettou,et al.  A pr 2 01 8 The Web of False Information : Rumors , Fake News , Hoaxes , Clickbait , and Various Other Shenanigans , 2018 .

[41]  Michal Jankowski-Lorek,et al.  Automated Credibility Assessment on Twitter , 2015, Comput. Sci..