Thai Fake News Detection Based on Information Retrieval, Natural Language Processing and Machine Learning

Fake news is a big problem in every society. Fake news must be detected and its sharing should be stopped before it causes further damage to the country. Spotting fake news is challenging because of its dynamics. In this research, we propose a framework for robust Thai fake news detection. The framework comprises three main modules, including information retrieval, natural language processing, and machine learning. This research has two phases: the data collection phase and the machine learning model building phase. In the data collection phase, we obtained data from Thai online news websites using web-crawler information retrieval, and we analyzed the data using natural language processing techniques to extract good features from web data. For comparison, we selected some well-known classification Machine Learning models, including Naïve Bayesian, Logistic Regression, K-Nearest Neighbor, Multilayer Perceptron, Support Vector Machine, Decision Tree, Random Forest, Rule-Based Classifier, and Long Short-Term Memory. The comparison study on the test set showed that Long Short-Term Memory was the best model, and we deployed an automatic online fake news detection web application.

[1]  Kai Shu Beyond News Contents: The Role of Social Context for Fake News Detection , 2018 .

[2]  Jungang Xu,et al.  A Survey on Neural Network Language Models , 2019, ArXiv.

[3]  Matthias Schroder,et al.  Logistic Regression: A Self-Learning Text , 2003 .

[4]  Akihiko Ohsuga,et al.  Fake News Detection with Generated Comments for News Articles , 2020, 2020 IEEE 24th International Conference on Intelligent Engineering Systems (INES).

[5]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[6]  M. Gentzkow,et al.  Social Media and Fake News in the 2016 Election , 2017 .

[7]  A. Haq,et al.  A Novel Stacking Approach for Accurate Detection of Fake News , 2021, IEEE Access.

[8]  Vladimir Vapnik,et al.  Support-vector networks , 2004, Machine Learning.

[9]  Kristin L. Sainani,et al.  Logistic Regression , 2014, PM & R : the journal of injury, function, and rehabilitation.

[10]  Visualizing the Simple Bayesian Classi er , 1997 .

[11]  M. LaValley,et al.  Logistic Regression , 2008, Circulation.

[12]  Victor Maojo,et al.  A context vector model for information retrieval , 2002, J. Assoc. Inf. Sci. Technol..

[13]  Ronald R. Yager,et al.  An extension of the naive Bayesian classifier , 2006, Inf. Sci..

[14]  Pakpoom Mookdarsanit,et al.  The COVID-19 fake news detection in Thai social texts , 2021 .

[15]  Hasan Fleyeh,et al.  Construction site accident analysis using text mining and natural language processing techniques , 2019, Automation in Construction.

[16]  Harry Zhang,et al.  The Optimality of Naive Bayes , 2004, FLAIRS.

[17]  Huan Liu,et al.  Mining Disinformation and Fake News: Concepts, Methods, and Recent Advancements , 2020, Lecture Notes in Social Networks.

[18]  Yaxin Bi,et al.  KNN Model-Based Approach in Classification , 2003, OTM.

[19]  Arash Habibi Lashkari,et al.  A Boolean Model in Information Retrieval for Search Engines , 2009, 2009 International Conference on Information Management and Engineering.

[20]  Huan Liu,et al.  Beyond News Contents: The Role of Social Context for Fake News Detection , 2017, WSDM.

[21]  Mehrdad Saif,et al.  Power production prediction of wind turbines using a fusion of MLP and ANFIS networks , 2018, IET Renewable Power Generation.

[22]  Phayung Meesad,et al.  Developing an effective Thai Document Categorization Framework base on term relevance frequency weighting , 2010, 2010 Eighth International Conference on ICT and Knowledge Engineering.

[23]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[24]  G. Gunasekaran,et al.  Prevention of credit card fraud detection based on HSVM , 2016, 2016 International Conference on Information Communication and Embedded Systems (ICICES).

[25]  Jiangbin Zheng,et al.  Supervised ensemble learning methods towards automatically filtering Urdu fake news within social media , 2021, PeerJ Comput. Sci..

[26]  Md. Shafiur Rahman,et al.  An efficient hybrid system for anomaly detection in social networks , 2021, Cybersecur..

[27]  D. Wooff Logistic Regression: a Self-learning Text, 2nd edn , 2004 .

[28]  Hao Li,et al.  Noninvasive fracture characterization based on the classification of sonic wave travel times , 2020 .

[29]  Yang Liu,et al.  An introduction to decision tree modeling , 2004 .

[30]  Prabhas Chongstitvatana,et al.  Detecting Fake News with Machine Learning Method , 2018, 2018 15th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON).

[31]  Shie-Jue Lee,et al.  A weighted LS-SVM based learning system for time series forecasting , 2015, Inf. Sci..

[32]  Lara Lloret Iglesias,et al.  Fake news detection using Deep Learning , 2019, Regular.

[33]  Pattarawat Chormai,et al.  AttaCut: A Fast and Accurate Neural Thai Word Segmenter , 2019, ArXiv.

[34]  Jubilant J. Kizhakkethottam,et al.  Student Academic Performance Prediction Model Using Decision Tree and Fuzzy Genetic Algorithm , 2016 .

[35]  Hugo Zaragoza,et al.  The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[36]  J. Ross Quinlan,et al.  Improved Use of Continuous Attributes in C4.5 , 1996, J. Artif. Intell. Res..

[37]  Muhammad Ovais Ahmad,et al.  Fake News Detection Using Machine Learning Ensemble Methods , 2020, Complex..

[38]  Praveen Kumar Donepudi,et al.  Detecting Fake News Using Machine Learning : A Systematic Literature Review , 2021, ArXiv.

[39]  Jürgen Schmidhuber,et al.  Learning to forget: continual prediction with LSTM , 1999 .

[40]  Heng Tao Shen,et al.  Beyond Frame-level CNN: Saliency-Aware 3-D CNN With LSTM for Video Action Recognition , 2017, IEEE Signal Processing Letters.

[41]  Lei Zhang,et al.  Performance Study of Multilayer Perceptrons in a Low-Cost Electronic Nose , 2014, IEEE Transactions on Instrumentation and Measurement.

[42]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[43]  Pooja Jain,et al.  Vector representation of words for sentiment analysis using GloVe , 2017, 2017 International Conference on Intelligent Communication and Computational Techniques (ICCT).

[44]  Dongyan Zhao,et al.  How does Truth Evolve into Fake News? An Empirical Study of Fake News Evolution , 2021, WWW.

[45]  Jianfeng Zhan,et al.  Cosine Normalization: Using Cosine Similarity Instead of Dot Product in Neural Networks , 2017, ICANN.

[46]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[47]  Madian Khabsa,et al.  On Unifying Misinformation Detection , 2021, NAACL.

[48]  Peter Clark,et al.  The CN2 Induction Algorithm , 1989, Machine Learning.

[49]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[50]  Pradeep K. Atrey,et al.  Media-Rich Fake News Detection: A Survey , 2018, 2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR).

[51]  Laks V. S. Lakshmanan,et al.  Combating Fake News: A Data Management and Mining Perspective , 2019, Proc. VLDB Endow..

[52]  Hyun-Chul Kim,et al.  Bayesian Classifier Combination , 2012, AISTATS.

[53]  Martin T. Hagan,et al.  Neural network design , 1995 .

[54]  Huajun Chen,et al.  A review: The effects of imperfect data on incremental decision tree , 2018, Int. J. Inf. Commun. Technol..

[55]  Peter Clark,et al.  Rule Induction with CN2: Some Recent Improvements , 1991, EWSL.

[56]  Gyu Sang Choi,et al.  Fake News Stance Detection Using Deep Learning Architecture (CNN-LSTM) , 2020, IEEE Access.