Clickbait Detection in Telugu: Overcoming NLP Challenges in Resource-Poor Languages using Benchmarked Techniques

Clickbait headlines have become a nudge in social media and news websites. The methods to identify clickbaits are largely being developed for English. There is a need for the same in other languages as well with the increase in the usage of social media platforms in different languages. In this work, we present an annotated clickbait dataset of 112,657 headlines that can be used for building an automated clickbait detection system for Telugu, a resource-poor language. Our contribution in this paper includes (i) generation of the latest pre-trained language models, including RoBERTa, ALBERT, and ELECTRA trained on a large Telugu corpora of 8,015,588 sentences that we had collected, (ii) data analysis and benchmarking the performance of different approaches ranging from hand-crafted features to state-of-the-art models. We show that the pre-trained language models trained on Telugu outperform the existing pre-trained models viz. BERT-Mulingual-Case [1], XLM-MLM [2], and XLM-R [3] on clickbait task. On a large Telugu clickbait dataset of 112,657 samples, the Light Gradient Boosted Machines (LGBM) model achieves an F1-score of 0.94 for clickbait headlines. For Non-Clickbait headlines, F1-score of 0.93 is obtained which is similar to that of Clickbait class. We open-source our dataset, pre-trained models, and code ‘ We show that the pre-trained language models trained on Telugu outperform the existing pre-trained models viz. BERT-Mulingual-Case [1], XLM-MLM [2], and XLM-R [3] on clickbait task. On a large Telugu clickbait dataset of 112,657 samples, the Light Gradient Boosted Machines (LGBM) model achieves an F1-score of 0.94 for clickbait headlines. For Non-Clickbait headlines, F1-score of 0.93 is obtained which is similar to that of Clickbait class. We open-source our dataset, pre-trained models, and code11https://github.com/subbareddy248/Clickbait-Resources

[1]  G. Loewenstein The psychology of curiosity: A review and reinterpretation. , 1994 .

[2]  Xing Zhou,et al.  Real-Time News Cer tification System on Sina Weibo , 2015, WWW.

[3]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[4]  Wenpeng Yin,et al.  Learning Word Meta-Embeddings , 2016, ACL.

[5]  Niloy Ganguly,et al.  Stop Clickbait: Detecting and preventing clickbaits in online news media , 2016, 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[6]  Matthias Hagen,et al.  The Clickbait Challenge 2017: Towards a Regression Model for Clickbait Strength , 2018, ArXiv.

[7]  Tie-Yan Liu,et al.  LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.

[8]  Benjamin Heinzerling,et al.  BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages , 2017, LREC.

[9]  Tanmoy Chakraborty,et al.  We Used Neural Networks to Detect Clickbaits: You Won't Believe What Happened Next! , 2016, ECIR.

[10]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[11]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[12]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[13]  G. Beauchamp,et al.  Time for some a priori thinking about post hoc testing , 2008 .

[14]  Peng Zhou,et al.  Text Classification Improved by Integrating Bidirectional LSTM with Two-dimensional Max Pooling , 2016, COLING.

[15]  Júlio Cesar dos Reis,et al.  Breaking the News: First Impressions Matter on Online News , 2015, ICWSM.

[16]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[17]  Chih-Jen Lin,et al.  Dual coordinate descent methods for logistic regression and maximum entropy models , 2011, Machine Learning.

[18]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[19]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[20]  Trang T. Le,et al.  Scaling tree-based automated machine learning to biomedical big data with a feature set selector , 2019, Bioinform..

[21]  Matthias Hagen,et al.  Clickbait Detection , 2016, ECIR.

[22]  Prakhar Biyani,et al.  "8 Amazing Secrets for Getting More Clicks": Detecting Clickbaits in News Streams Using Article Informality , 2016, AAAI.

[23]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[24]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[25]  Naeemul Hassan,et al.  Diving Deep into Clickbaits: Who Use Them to What Extents in Which Topics with What Effects? , 2017, ASONAM.

[26]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[27]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[28]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[29]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[30]  Dhruv Khattar,et al.  A Neural Clickbait Detection Engine , 2017, ArXiv.

[31]  Ponnurangam Kumaraguru,et al.  Detecting clickbaits using two-phase hybrid CNN-LSTM biterm model , 2020, Expert Syst. Appl..

[32]  Rosa Andrie Asmara,et al.  Study of hoax news detection using naïve bayes classifier in Indonesian language , 2017, 2017 11th International Conference on Information & Communication Technology and System (ICTS).

[33]  Hae-Young Kim,et al.  Analysis of variance (ANOVA) comparing means of more than two groups , 2014, Restorative dentistry & endodontics.

[34]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[35]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[36]  Amol Agrawal,et al.  Clickbait detection using deep learning , 2016, 2016 2nd International Conference on Next Generation Computing Technologies (NGCT).