BanFakeNews: A Dataset for Detecting Fake News in Bangla

Observing the damages that can be done by the rapid propagation of fake news in various sectors like politics and finance, automatic identification of fake news using linguistic analysis has drawn the attention of the research community. However, such methods are largely being developed for English where low resource languages remain out of the focus. But the risks spawned by fake and manipulative news are not confined by languages. In this work, we propose an annotated dataset of ~50K news that can be used for building automated fake news detection systems for a low resource language like Bangla. Additionally, we provide an analysis of the dataset and develop a benchmark system with state of the art NLP techniques to identify Bangla fake news. To create this system, we explore traditional linguistic features and neural network based methods. We expect this dataset will be a valuable resource for building technologies to prevent the spreading of fake news and contribute in research with low resource languages.

[1]  William Yang Wang “Liar, Liar Pants on Fire”: A New Benchmark Dataset for Fake News Detection , 2017, ACL.

[2]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[3]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[4]  Zhiyuan Liu,et al.  Neural Sentiment Classification with User and Product Attention , 2016, EMNLP.

[5]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[6]  Ryan L. Boyd,et al.  The Development and Psychometric Properties of LIWC2015 , 2015 .

[7]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[8]  Rosa Andrie Asmara,et al.  Study of hoax news detection using naïve bayes classifier in Indonesian language , 2017, 2017 11th International Conference on Information & Communication Technology and System (ICTS).

[9]  Victoria L. Rubin,et al.  Fake News or Truth? Using Satirical Cues to Detect Potentially Misleading News , 2016 .

[10]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[11]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[12]  Paolo Rosso,et al.  Convolutional Neural Networks for Authorship Attribution of Short Texts , 2017, EACL.

[13]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[14]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL 2006.

[15]  Prakhar Gupta,et al.  Learning Word Vectors for 157 Languages , 2018, LREC.

[16]  Abien Fred Agarap Deep Learning using Rectified Linear Units (ReLU) , 2018, ArXiv.

[17]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[18]  Chu-Ren Huang,et al.  Fake News Detection Through Multi-Perspective Speaker Profiles , 2017, IJCNLP.

[19]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[20]  Wonyong Sung,et al.  Character-level language modeling with hierarchical recurrent neural networks , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Preslav Nakov,et al.  Fully Automated Fact Checking Using External Sources , 2017, RANLP.

[22]  Andreas Vlachos,et al.  Fact Checking: Task definition and dataset construction , 2014, LTCSS@ACL.

[23]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[24]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[25]  Dhruv Khattar,et al.  A Neural Clickbait Detection Engine , 2017, ArXiv.

[26]  Marilyn A. Walker,et al.  And That’s A Fact: Distinguishing Factual and Emotional Argumentation in Online Dialogue , 2015, ArgMining@HLT-NAACL.

[27]  Fan Yang,et al.  Satirical News Detection and Analysis using Attention Mechanism and Linguistic Features , 2017, EMNLP.

[28]  Niloy Ganguly,et al.  Stop Clickbait: Detecting and preventing clickbaits in online news media , 2016, 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[29]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[30]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[31]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[32]  Johan Bollen,et al.  Computational Fact Checking from Knowledge Networks , 2015, PloS one.

[33]  Timothy Baldwin,et al.  Automatic Satire Detection: Are You Having a Laugh? , 2009, ACL.

[34]  Vasudeva Varma,et al.  Identifying Clickbait: A Multi-Strategy Approach Using Neural Networks , 2017, SIGIR.

[35]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[36]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[37]  Jun Zhao,et al.  Recurrent Convolutional Neural Networks for Text Classification , 2015, AAAI.

[38]  Wei Shi,et al.  Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification , 2016, ACL.

[39]  Verónica Pérez-Rosas,et al.  Automatic Detection of Fake News , 2017, COLING.

[40]  Tanvir Ahmad,et al.  Satire Detection from Web Documents Using Machine Learning Methods , 2014, 2014 International Conference on Soft Computing and Machine Intelligence.

[41]  Xing Zhou,et al.  Real-Time News Cer tification System on Sina Weibo , 2015, WWW.

[42]  Adnan Ahmad,et al.  Bengali word embeddings and it's application in solving document classification problem , 2016, 2016 19th International Conference on Computer and Information Technology (ICCIT).

[43]  Suhang Wang,et al.  Fake News Detection on Social Media: A Data Mining Perspective , 2017, SKDD.

[44]  Lijun Qian,et al.  Deep Two-path Semi-supervised Learning for Fake News Detection , 2019, ArXiv.

[45]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[46]  Naeemul Hassan,et al.  Diving Deep into Clickbaits: Who Use Them to What Extents in Which Topics with What Effects? , 2017, ASONAM.

[47]  Prakhar Biyani,et al.  "8 Amazing Secrets for Getting More Clicks": Detecting Clickbaits in News Streams Using Article Informality , 2016, AAAI.

[48]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[49]  Yue Zhang,et al.  Deceptive Opinion Spam Detection Using Neural Network , 2016, COLING.