Measuring the Impact of Readability Features in Fake News Detection

The proliferation of fake news is a current issue that influences a number of important areas of society, such as politics, economy and health. In the Natural Language Processing area, recent initiatives tried to detect fake news in different ways, ranging from language-based approaches to content-based verification. In such approaches, the choice of the features for the classification of fake and true news is one of the most important parts of the process. This paper presents a study on the impact of readability features to detect fake news for the Brazilian Portuguese language. The results show that such features are relevant to the task (achieving, alone, up to 92% classification accuracy) and may improve previous classification results.

[1]  Sandra M. Aluísio,et al.  A Lightweight Regression Method to Infer Psycholinguistic Properties for Brazilian Portuguese , 2017, TSD.

[2]  Svante Wold,et al.  Analysis of variance (ANOVA) , 1989 .

[3]  Johan Bollen,et al.  Computational Fact Checking from Knowledge Networks , 2015, PloS one.

[4]  Jorge Baptista,et al.  Assisting European Portuguese Teaching: Linguistic Features Extraction and Automatic Readability Classifier , 2015, CSEDU.

[5]  Stewart Clem,et al.  Post-Truth and Vices Opposed to Truth , 2017 .

[6]  Beatriz de la Iglesia,et al.  Survey on Feature Selection , 2015, ArXiv.

[7]  Anders Søgaard,et al.  Learning to Predict Readability Using Eye-Movement Data From Natives and Learners , 2018, AAAI.

[8]  Jay F. Nunamaker,et al.  A Comparison of Classification Methods for Predicting Deception in Computer-Mediated Communication , 2004, J. Manag. Inf. Syst..

[9]  S. T. Dumais,et al.  Using latent semantic analysis to improve access to textual information , 1988, CHI '88.

[10]  Istituto italiano degli attuari Giornale dell'Istituto italiano degli attuari , 1930 .

[11]  Leon Palafox,et al.  Detection of Fake News based on readability , 2019 .

[12]  Ferat Sahin,et al.  A survey on feature selection methods , 2014, Comput. Electr. Eng..

[13]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[14]  Victoria L. Rubin,et al.  Towards News Verification: Deception Detection Methods for News Discourse , 2015 .

[15]  Verónica Pérez-Rosas,et al.  Experiments in Open Domain Deception Detection , 2015, EMNLP.

[16]  Clayton J. Hutto,et al.  Discriminative Models for Predicting Deception Strategies , 2015, WWW.

[17]  Carolina Scarton,et al.  Comparando Avaliações de Inteligibilidade Textual entre Originais e Traduções de Textos Literários (Comparing Textual Intelligibility Evaluations among Literary Source Texts and their Translations) [in Portuguese] , 2011, STIL.

[18]  Fabrício Benevenuto,et al.  Linguistic Diversities of Demographic Groups in Twitter , 2017, HT.

[19]  Arthur C. Graesser,et al.  Automated Evaluation of Text and Discourse with Coh-Metrix: Introduction , 2014 .

[20]  Arthur C. Graesser,et al.  Coh-Metrix , 2011 .

[21]  M. Friedman The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .

[22]  William H. DuBay Smart Language: Readers, Readability, and the Grading of Text , 2007 .

[23]  N. Smirnov Table for Estimating the Goodness of Fit of Empirical Distributions , 1948 .

[24]  Caroline Gasperin,et al.  SIMPLIFICA: a tool for authoring simplified texts in Brazilian Portuguese guided by readability assessments , 2010, NAACL.

[25]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[26]  Georg Rehm,et al.  From Clickbait to Fake News Detection: An Approach based on Detecting the Stance of Headlines to Articles , 2017, NLPmJ@EMNLP.

[27]  Sandra M. Aluísio,et al.  Análise da Inteligibilidade de textos via ferramentas de Processamento de Língua Natural: adaptando as métricas do Coh-Metrix para o Português , 2010, Linguamática.

[28]  Maria das Graças Volpe Nunes,et al.  Readability formulas applied to textbooks in brazilian portuguese , 1996 .

[29]  Simonetta Montemagni,et al.  Assessing document and sentence readability in less resourced languages and across textual genres , 2014 .

[30]  E. B. Wilson,et al.  The Distribution of Chi-Square. , 1931, Proceedings of the National Academy of Sciences of the United States of America.

[31]  Vera Demberg,et al.  Psycholinguistic Models of Sentence Processing Improve Sentence Readability Ranking , 2017, EACL.

[32]  Simonetta Montemagni,et al.  READ–IT: Assessing Readability of Italian Texts with a View to Text Simplification , 2011, SLPAT.

[33]  J. Kent Information gain and a general measure of correlation , 1983 .

[34]  Thiago Ferreira Covões,et al.  Fake News Detection Using One-Class Classification , 2019, 2019 8th Brazilian Conference on Intelligent Systems (BRACIS).

[35]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[36]  R. Gunning The Technique of Clear Writing. , 1968 .

[37]  António Branco,et al.  Rolling out Text Categorization for Language Learning Assessment Supported by Language Technology , 2014, PROPOR.

[38]  Rodrigo Nogueira,et al.  Desenvolvimento de um sistema para a classificação de Fakenews acoplado à etapa de ETL de um Data Warehouse de Textos de Notícias em língua Portuguesa , 2019 .

[39]  Luiz Gomes,et al.  Fake News and Brazilian politics - temporal investigation based on semantic annotations and graph analysis , 2019, SBBD.

[40]  Samar Husain,et al.  Quantifying sentence complexity based on eye-tracking measures , 2016, CL4LC@COLING 2016.

[41]  AN Kolmogorov-Smirnov,et al.  Sulla determinazione empírica di uma legge di distribuzione , 1933 .

[42]  M. Gentzkow,et al.  Social Media and Fake News in the 2016 Election , 2017 .

[43]  Walt Detmar Meurers,et al.  Readability-based Sentence Ranking for Evaluating Text Simplification , 2016, ArXiv.

[44]  Samina Khalid,et al.  A survey of feature selection and feature extraction techniques in machine learning , 2014, 2014 Science and Information Conference.

[45]  Xiao-Hua Zhou,et al.  Statistical Methods for Meta‐Analysis , 2008 .

[46]  Tiago A. Almeida,et al.  Towards automatically filtering fake news in Portuguese , 2020, Expert Syst. Appl..

[47]  Jo Campling,et al.  Analysis of Variance (ANOVA) , 2002 .

[48]  Svitlana Volkova,et al.  Separating Facts from Fiction: Linguistic Models to Classify Suspicious and Trusted News Posts on Twitter , 2017, ACL.

[49]  Barbara Poblete,et al.  Predicting information credibility in time-sensitive social media , 2013, Internet Res..

[50]  S. L. Sporer,et al.  Are Computers Effective Lie Detectors? A Meta-Analysis of Linguistic Cues to Deception , 2015, Personality and social psychology review : an official journal of the Society for Personality and Social Psychology, Inc.

[51]  R. P. Fishburne,et al.  Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel , 1975 .

[52]  Tiago A. Almeida,et al.  Contributions to the Study of Fake News in Portuguese: New Corpus and Automatic Detection Results , 2018, PROPOR.

[53]  Gregory W. Corder,et al.  Nonparametric Statistics : A Step-by-Step Approach , 2014 .

[54]  Maria Tereza Camargo Biderman,et al.  Meu primeiro livro de palavras : um dicionário ilustrado do português de A a Z , 2007 .

[55]  E. Brunet Le vocabulaire de Jean Giraudoux : structure et évolution : statistique et informatique appliquées à l'étude des textes à partir des données du Trésor de la Langue Française , 1978 .

[56]  Rudolf Franz Flesch,et al.  How to write plain English : a book for lawyers and consumers , 1979 .

[57]  Verónica Pérez-Rosas,et al.  Automatic Detection of Fake News , 2017, COLING.

[58]  Benno Stein,et al.  A Stylometric Inquiry into Hyperpartisan and Fake News , 2017, ACL.

[59]  Sandra M. Aluísio,et al.  An Evaluation of the Brazilian Portuguese LIWC Dictionary for Sentiment Analysis , 2013, STIL.

[60]  Lucia Specia,et al.  Readability Assessment for Text Simplification , 2010 .

[61]  Andre Luiz Verucci da Cunha,et al.  Coh-Metrix-Dementia: análise automática de distúrbios de linguagem nas demências utilizando Processamento de Línguas Naturais , 2015 .

[62]  Cody Buntain,et al.  Automatically Identifying Fake News in Popular Twitter Threads , 2017, 2017 IEEE International Conference on Smart Cloud (SmartCloud).

[63]  James W. Pennebaker,et al.  Linguistic Inquiry and Word Count (LIWC2007) , 2007 .

[64]  Miriam J. Metzger,et al.  The science of fake news , 2018, Science.

[65]  M. Coleman,et al.  A computer readability formula designed for machine scoring. , 1975 .