A Retrospective Analysis of the Fake News Challenge Stance-Detection Task

The 2017 Fake News Challenge Stage 1 (FNC-1) shared task addressed a stance classification task as a crucial first step towards detecting fake news. To date, there is no in-depth analysis paper to critically discuss FNC-1’s experimental setup, reproduce the results, and draw conclusions for next-generation stance classification methods. In this paper, we provide such an in-depth analysis for the three top-performing systems. We first find that FNC-1’s proposed evaluation metric favors the majority class, which can be easily classified, and thus overestimates the true discriminative power of the methods. Therefore, we propose a new F1-based metric yielding a changed system ranking. Next, we compare the features and architectures used, which leads to a novel feature-rich stacked LSTM model that performs on par with the best systems, but is superior in predicting minority classes. To understand the methods’ ability to generalize, we derive a new dataset and perform both in-domain and cross-domain experiments. Our qualitative and quantitative study helps interpreting the original FNC-1 scores and understand which features help improving performance and why. Our new dataset and all source code used during the reproduction study are publicly available for future research.

[1]  Andreas Vlachos,et al.  Emergent: a novel data-set for stance classification , 2016, NAACL.

[2]  R. Mitkov,et al.  What can readability measures really tell us about text complexity , 2012 .

[3]  Tejashri Inadarchand Jain,et al.  Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis , 2010 .

[4]  Arkaitz Zubiaga,et al.  Hawkes Processes for Continuous Time Sequence Classification: an Application to Rumour Stance Classification in Twitter , 2016, ACL.

[5]  Suhang Wang,et al.  Fake News Detection on Social Media: A Data Mining Perspective , 2017, SKDD.

[6]  Peter D. Turney,et al.  Emotions Evoked by Common Words and Phrases: Using Mechanical Turk to Create an Emotion Lexicon , 2010, HLT-NAACL 2010.

[7]  Saif Mohammad,et al.  NRC-Canada-2014: Recent Improvements in the Sentiment Analysis of Tweets , 2014, SemEval@COLING.

[8]  M. Coleman,et al.  A computer readability formula designed for machine scoring. , 1975 .

[9]  Saif Mohammad,et al.  SemEval-2016 Task 6: Detecting Stance in Tweets , 2016, *SEMEVAL.

[10]  R. P. Fishburne,et al.  Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel , 1975 .

[11]  Dirk Hovy,et al.  Learning Whom to Trust with MACE , 2013, NAACL.

[12]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[13]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[14]  B. Morton Fake news. , 2018, Marine pollution bulletin.

[15]  Isabelle Augenstein,et al.  A simple but tough-to-beat baseline for the Fake News Challenge stance detection task , 2017, ArXiv.

[16]  Sibel Adali,et al.  This Just In: Fake News Packs a Lot in Title, Uses Simpler, Repetitive Content in Text Body, More Similar to Satire than Real News , 2017, Proceedings of the International AAAI Conference on Web and Social Media.

[17]  Marilyn A. Walker,et al.  Stance Classification using Dialogic Properties of Persuasion , 2012, NAACL.

[18]  Iryna Gurevych,et al.  Parsing Argumentation Structures in Persuasive Essays , 2016, CL.

[19]  James R. Foulds,et al.  Joint Models of Disagreement and Stance in Online Debate , 2015, ACL.

[20]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[21]  Preslav Nakov,et al.  SemEval-2015 Task 10: Sentiment Analysis in Twitter , 2015, *SEMEVAL.

[22]  Arkaitz Zubiaga,et al.  SemEval-2017 Task 8: RumourEval: Determining rumour veracity and support for rumours , 2017, *SEMEVAL.

[23]  Ruifeng Xu,et al.  Stance Classification with Target-specific Neural Attention , 2017, IJCAI.

[24]  Swapna Somasundaran,et al.  Recognizing Stances in Ideological On-Line Debates , 2010, HLT-NAACL 2010.

[25]  Chih-Jen Lin,et al.  Projected Gradient Methods for Nonnegative Matrix Factorization , 2007, Neural Computation.

[26]  Paolo Rosso,et al.  Overview of the Task on Stance and Gender Detection in Tweets on Catalan Independence , 2017, IberEval@SEPLN.

[27]  Ron Artstein,et al.  Survey Article: Inter-Coder Agreement for Computational Linguistics , 2008, CL.

[28]  Saif Mohammad,et al.  NRC-Canada: Building the State-of-the-Art in Sentiment Analysis of Tweets , 2013, *SEMEVAL.

[29]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[30]  Vincent Ng,et al.  Stance Classification of Ideological Debates: Data, Models, Features, and Constraints , 2013, IJCNLP.

[31]  E A Smith,et al.  Automated readability index. , 1967, AMRL-TR. Aerospace Medical Research Laboratories.

[32]  Georg Rehm,et al.  From Clickbait to Fake News Detection: An Approach based on Detecting the Stance of Headlines to Articles , 2017, NLPmJ@EMNLP.

[33]  Saif Mohammad,et al.  Sentiment Analysis of Short Informal Texts , 2014, J. Artif. Intell. Res..

[34]  Richard Davis Fake News , Real Consequences : Recruiting Neural Networks for the Fight Against Fake News , 2017 .

[35]  Mike Y. Chen,et al.  Yahoo! for Amazon: Sentiment Extraction from Small Talk on the Web , 2001 .

[36]  Chris Taylor,et al.  Emergent: a real-time rumor tracker , 2016 .

[37]  Jonathan Anderson Lix and Rix: Variations on a Little-Known Readability Index. , 1983 .

[38]  Andreas Vlachos,et al.  Fake news stance detection using stacked ensemble of classifiers , 2017, NLPmJ@EMNLP.

[39]  Benjamin Schrauwen,et al.  Training and Analysing Deep Recurrent Neural Networks , 2013, NIPS.

[40]  Benno Stein,et al.  The Argument Reasoning Comprehension Task: Identification and Reconstruction of Implicit Warrants , 2017, NAACL.

[41]  Guido Zarrella,et al.  MITRE at SemEval-2016 Task 6: Transfer Learning for Stance Detection , 2016, *SEMEVAL.

[42]  Kalina Bontcheva,et al.  Stance Detection with Bidirectional Conditional Encoding , 2016, EMNLP.

[43]  G. Harry McLaughlin,et al.  SMOG Grading - A New Readability Formula. , 1969 .

[44]  Saif Mohammad,et al.  CROWDSOURCING A WORD–EMOTION ASSOCIATION LEXICON , 2013, Comput. Intell..