Misinformation Detection on YouTube Using Video Captions

Millions of people use platforms such as YouTube, Facebook, Twitter, and other mass media. Due to the accessibility of these platforms, they are often used to establish a narrative, conduct propaganda, and disseminate misinformation. This work proposes an approach that uses state-of-the-art NLP techniques to extract features from video captions (subtitles). To evaluate our approach, we utilize a publicly accessible and labeled dataset for classifying videos as misinformation or not. The motivation behind exploring video captions stems from our analysis of videos metadata. Attributes such as the number of views, likes, dislikes, and comments are ineffective as videos are hard to differentiate using this information. Using caption dataset, the proposed models can classify videos among three classes (Misinformation, Debunking Misinformation, and Neutral) with 0.85 to 0.90 F1score. To emphasize the relevance of the misinformation class, we re-formulate our classification problem as a two-class classification Misinformation vs. others (Debunking Misinformation and Neutral). In our experiments, the proposed models can classify videos with 0.92 to 0.95 F1-score and 0.78 to 0.90 AUC ROC.

[1]  Massimo Di Pierro,et al.  Automatic Online Fake News Detection Combining Content and Social Signals , 2018, 2018 22nd Conference of Open Innovations Association (FRUCT).

[2]  Thomas A. Runkler,et al.  Neural Architectures for Fine-Grained Propaganda Detection in News , 2019, EMNLP.

[3]  E. Bitencourt,et al.  Fake news as fake politics: the digital materialities of YouTube misinformation videos about Brazilian oil spill catastrophe , 2020 .

[4]  Sun Kyong Lee,et al.  The effects of news consumption via social media and news information overload on perceptions of journalistic norms and practices , 2017, Comput. Hum. Behav..

[5]  Sylvia M. Chan-Olmsted,et al.  Misinformation on Instagram: The Impact of Trusted Endorsements on Message Credibility , 2020, Social Media + Society.

[6]  Christo Wilson,et al.  Modeling and Measuring Expressed (Dis)belief in (Mis)information , 2020, ICWSM.

[7]  Wen Li,et al.  Exploiting User Comments for Audio-Visual Content Indexing and Retrieval , 2013, ECIR.

[8]  Gerhard Weikum,et al.  DeClarE: Debunking Fake News and False Claims using Evidence-Aware Deep Learning , 2018, EMNLP.

[9]  Fan Yang,et al.  Attending Sentences to detect Satirical Fake News , 2018, COLING.

[10]  Victoria L. Rubin,et al.  Fake News or Truth? Using Satirical Cues to Detect Potentially Misleading News , 2016 .

[11]  Nitin Agarwal,et al.  Analyzing Disinformation and Crowd Manipulation Tactics on YouTube , 2018, 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[12]  C. Ryerson,et al.  YouTube Videos as a Source of Misinformation on Idiopathic Pulmonary Fibrosis , 2019, Annals of the American Thoracic Society.

[13]  Svitlana Volkova,et al.  Separating Facts from Fiction: Linguistic Models to Classify Suspicious and Trusted News Posts on Twitter , 2017, ACL.

[14]  Xu-Ying Liu,et al.  Towards Class-Imbalance Aware Multi-Label Learning , 2015, IEEE Transactions on Cybernetics.

[15]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[16]  Reza Zafarani,et al.  Fake News: A Survey of Research, Detection Methods, and Opportunities , 2018, ArXiv.

[17]  Heidi Oi-Yee Li,et al.  YouTube as a source of information on COVID-19: a pandemic of misinformation? , 2020, BMJ Global Health.

[18]  Dragomir R. Radev,et al.  Rumor has it: Identifying Misinformation in Microblogs , 2011, EMNLP.

[19]  Lorenzo Cioni,et al.  Misinformation on vaccination: A quantitative analysis of YouTube videos , 2018, Human vaccines & immunotherapeutics.

[20]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[21]  Sertan Girgin,et al.  Text based user comments as a signal for automatic language identification of online videos , 2017, ICMI.

[22]  Rajesh Sharma,et al.  Identifying Possible Rumor Spreaders on Twitter: A Weak Supervised Learning Approach , 2020, 2021 International Joint Conference on Neural Networks (IJCNN).

[23]  M. Gentzkow,et al.  Social Media and Fake News in the 2016 Election , 2017 .

[24]  Emilio Ferrara,et al.  What types of COVID-19 conspiracies are populated by Twitter bots? , 2020, First Monday.

[25]  Hsinchun Chen,et al.  Text‐based video content classification for online video‐sharing sites , 2010, J. Assoc. Inf. Sci. Technol..

[26]  Kathleen M. Carley,et al.  Tree LSTMs with Convolution Units to Predict Stance and Rumor Veracity in Social Media Conversations , 2019, ACL.

[27]  Suhang Wang,et al.  Fake News Detection on Social Media: A Data Mining Perspective , 2017, SKDD.

[28]  Hernán A. Makse,et al.  CUNY Academic Works , 2022 .

[29]  Joshua A. Tucker,et al.  Less than you think: Prevalence and predictors of fake news dissemination on Facebook , 2019, Science Advances.

[30]  Andreas Vlachos,et al.  Automated Fact Checking: Task Formulations, Methods and Future Directions , 2018, COLING.

[31]  Verónica Pérez-Rosas,et al.  Automatic Detection of Fake News , 2017, COLING.

[32]  Iryna Gurevych,et al.  A Retrospective Analysis of the Fake News Challenge Stance-Detection Task , 2018, COLING.

[33]  Jacek Gwizdka,et al.  Healthcare professionals' acts of correcting health misinformation on social media , 2021, Int. J. Medical Informatics.

[34]  Tanushree Mitra,et al.  Measuring Misinformation in Video Search Platforms: An Audit Study on YouTube , 2020, Proc. ACM Hum. Comput. Interact..

[35]  Keith B. Hall,et al.  Improved video categorization from text metadata and user comments , 2011, SIGIR '11.

[36]  Luo Si,et al.  Rumor Detection on Social Media: Datasets, Methods and Opportunities , 2019, EMNLP.

[37]  Mehmet Fatih Çömlekçi Custodians of the Internet: Platforms, Content Moderation, and the Hidden Decisions that Shape Social Media , 2019 .

[38]  Claire Cardie,et al.  Properties, Prediction, and Prevalence of Useful User-Generated Comments for Descriptive Annotation of Social Media Objects , 2013, ICWSM.

[39]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[40]  Christo Wilson,et al.  Bias Misperceived: The Role of Partisanship and Misinformation in YouTube Comment Moderation , 2019, ICWSM.

[41]  M. Baldi,et al.  Behind the screen , 2019, Nature Astronomy.