The Many Dimensions of Truthfulness: Crowdsourcing Misinformation Assessments on a Multidimensional Scale

Recent work has demonstrated the viability of using crowdsourcing as a tool for evaluating the truthfulness of public statements. Under certain conditions such as: (1) having a balanced set of workers with different backgrounds and cognitive abilities; (2) using an adequate set of mechanisms to control the quality of the collected data; and (3) using a coarse grained assessment scale, the crowd can provide reliable identification of fake news. However, fake news are a subtle matter: statements can be just biased (“cherrypicked”), imprecise, wrong, etc. and the unidimensional truth scale used in existing work cannot account for such differences. In this paper we propose a multidimensional notion of truthfulness and we ask the crowd workers to assess seven different dimensions of truthfulness selected based on existing literature: Correctness, Neutrality, Comprehensibility, Precision, Completeness, Speaker’s Trustworthiness, and Informativeness. We deploy a set of quality control mechanisms to ensure that the thousands of assessments collected on 180 publicly available fact-checked statements distributed over two datasets are of adequate quality, including a custom search engine used by the crowd workers to find web pages supporting their truthfulness assessments. A comprehensive analysis of crowdsourced judgments shows that: (1) the crowdsourced assessments are reliable when compared to an expert-provided gold standard; (2) the proposed dimensions of truthfulness capture independent pieces of information; (3) the crowdsourcing task can be easily learned by the workers; and (4) the resulting assessments provide a useful basis for a more complete estimation of statement truthfulness.

[1]  Tarald O. Kvålseth,et al.  Note on Cohen's Kappa , 1989 .

[2]  J. Vaníček Software and data quality , 2018 .

[3]  William Yang Wang “Liar, Liar Pants on Fire”: A New Benchmark Dataset for Fake News Detection , 2017, ACL.

[4]  Douglas G Altman,et al.  Statistics Notes: Bootstrap resampling methods , 2015, BMJ : British Medical Journal.

[5]  J. Algina,et al.  Generalized eta and omega squared statistics: measures of effect size for some common research designs. , 2003, Psychological methods.

[6]  Eli Pariser,et al.  The Filter Bubble: What the Internet Is Hiding from You , 2011 .

[7]  K. Linnet,et al.  Nonparametric estimation of reference intervals by simple and bootstrap-based procedures. , 2000, Clinical chemistry.

[8]  Lora Aroyo,et al.  Capturing the Ineffable: Collecting, Analysing, and Automating Web Document Quality Assessments , 2016, EKAW.

[9]  Jin Zhang,et al.  Multidimensional relevance modeling via psychometrics and crowdsourcing , 2014, SIGIR.

[10]  Mark Sanderson,et al.  Using Collection Shards to Study Retrieval Performance Effect Sizes , 2019, ACM Trans. Inf. Syst..

[11]  Eddy Maddalena,et al.  Let's Agree to Disagree: Fixing Agreement Measures for Crowdsourcing , 2017, HCOMP.

[12]  Joshua A. Tucker,et al.  How Many People Live in Political Bubbles on Social Media? Evidence From Linked Survey and Twitter Data , 2019, SAGE Open.

[13]  Klaus Krippendorff,et al.  Computing Krippendorff's Alpha-Reliability , 2011 .

[14]  Amy X. Zhang,et al.  Investigating Differences in Crowdsourced News Credibility Assessment: Raters, Tasks, and Expert Criteria , 2020 .

[15]  Preslav Nakov,et al.  Overview of the CLEF-2019 CheckThat! Lab: Automatic Identification and Verification of Claims. Task 2: Evidence and Factuality , 2019, CLEF.

[16]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[17]  G. Harry McLaughlin,et al.  SMOG Grading - A New Readability Formula. , 1969 .

[18]  Gianluca Demartini,et al.  The COVID-19 Infodemic: Can the Crowd Judge Recent Misinformation Objectively? , 2020, CIKM.

[19]  Fabio Massimo Zanzotto Human-in-the-loop Artificial Intelligence , 2017, ArXiv.

[20]  Mounia Lalmas,et al.  Leveraging Behavioral Heterogeneity Across Markets for Cross-Market Training of Recommender Systems , 2020, WWW.

[21]  M. Coleman,et al.  A computer readability formula designed for machine scoring. , 1975 .

[22]  Justin M. Rao,et al.  Filter Bubbles, Echo Chambers, and Online News Consumption , 2016 .

[23]  James Allan,et al.  Comparing In Situ and Multidimensional Relevance Judgments , 2017, SIGIR.

[24]  David G. Rand,et al.  Will the Crowd Game the Algorithm?: Using Layperson Judgments to Combat Misinformation on Social Media by Downranking Distrusted Sources , 2020, CHI.

[25]  J. Chall,et al.  A FORMULA FOR PREDICTING READABILITY , 1948 .

[26]  Eddy Maddalena,et al.  The Impact of Task Abandonment in Crowdsourcing , 2019, IEEE Transactions on Knowledge and Data Engineering.

[27]  Christian Hansen,et al.  Neural Check-Worthiness Ranking with Weak Supervision: Finding Sentences for Fact-Checking , 2019, WWW.

[28]  Ming Dong,et al.  Misinformation-oriented expert finding in social networks , 2019, World Wide Web.

[29]  J. Shane Culpepper,et al.  On Topic Difficulty in IR Evaluation: The Effect of Systems, Corpora, and System Components , 2019, SIGIR.

[30]  Yang Liu,et al.  FNED: A Deep Network for Fake News Early Detection on Social Media , 2020, ACM Trans. Inf. Syst..

[31]  Massimo Melucci,et al.  Quantum-Like Structure in Multidimensional Relevance Judgements , 2020, ECIR.

[32]  Eddy Maddalena,et al.  Multidimensional News Quality: A Comparison of Crowdsourcing and Nichesourcing , 2018, CIKM Workshops.

[33]  Diane M. Strong,et al.  Information quality benchmarks: product and service performance , 2002, CACM.

[34]  Preslav Nakov,et al.  Automatic Fact-Checking Using Context and Discourse Information , 2019, ACM J. Data Inf. Qual..

[35]  Ricky J. Sethi,et al.  Fact Checking Misinformation Using Recommendations from Emotional Pedagogical Agents , 2019, ITS.

[36]  Carol L. Barry,et al.  Users' Criteria for Relevance Evaluation: A Cross-situational Comparison , 1998, Inf. Process. Manag..

[37]  Stefano Mizzaro,et al.  How Many Truth Levels? Six? One Hundred? Even More? Validating Truthfulness of Statements via Crowdsourcing , 2018, CIKM Workshops.

[38]  Giuseppe Primiero,et al.  Assessing the Quality of Online Reviews Using Formal Argumentation Theory , 2021, ICWE.

[39]  Chris Reed,et al.  Reason-checking fake news , 2020, Commun. ACM.

[40]  R. P. Fishburne,et al.  Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel , 1975 .

[41]  Jacob Cohen,et al.  The Equivalence of Weighted Kappa and the Intraclass Correlation Coefficient as Measures of Reliability , 1973 .

[42]  Chris Reed,et al.  Argument Mining: A Survey , 2020, Computational Linguistics.

[43]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[44]  Ricky J. Sethi Spotting Fake News: A Social Argumentation Framework for Scrutinizing Alternative Facts , 2017, 2017 IEEE International Conference on Web Services (ICWS).

[45]  Malcolm Williams,et al.  The Application of Argumentation Theory to Translation Quality Assessment , 2001 .

[46]  Nicola Ferro,et al.  A General Linear Mixed Models Approach to Study System Component Effects , 2016, SIGIR.

[47]  S. Frederick Journal of Economic Perspectives—Volume 19, Number 4—Fall 2005—Pages 25–42 Cognitive Reflection and Decision Making , 2022 .

[48]  Stefano Mizzaro,et al.  Crowdsourcing Truthfulness: The Impact of Judgment Scale and Assessor Bias , 2020, ECIR.

[49]  Masatoshi Yoshikawa,et al.  Annotating and Analyzing Biased Sentences in News Articles using Crowdsourcing , 2020, LREC.

[50]  Paolo Rosso,et al.  The Battle Against Online Harmful Information: The Cases of Fake News and Hate Speech , 2020, CIKM.

[51]  C. Thi Nguyen,et al.  ECHO CHAMBERS AND EPISTEMIC BUBBLES , 2018, Episteme.

[52]  John S. Caylor,et al.  Methodologies for Determining Reading Requirements Military Occupational Specialties. , 1973 .

[53]  Yunjie Calvin Xu,et al.  Relevance judgment: What do information users consider beyond topicality? , 2006, J. Assoc. Inf. Sci. Technol..

[54]  R. Likert “Technique for the Measurement of Attitudes, A” , 2022, The SAGE Encyclopedia of Research Design.

[55]  Nicola Ferro,et al.  Toward an anatomy of IR system component performances , 2018, J. Assoc. Inf. Sci. Technol..

[56]  Gianluca Demartini,et al.  Can The Crowd Identify Misinformation Objectively?: The Effects of Judgment Scale and Assessor's Background , 2020, SIGIR.

[57]  Emine Yilmaz,et al.  Crowdsourcing interactions: using crowdsourcing for evaluating interactive information retrieval systems , 2012, Information Retrieval.