Ensemble Learning Approach on Indonesian Fake News Classification

The news is information about a recently changed situation or a recent event. Serving as popular media information the internet has the power spread the news not only real news but fake news as well. We propose an ensemble learning approach on Indonesian fake news in order to separate fake news from the real one and to tackle imbalanced data problem which we face on the given dataset. Our experiment result shows that random forest classifier as the ensemble classifier which obtained 0.98 f1-score is superior to multinomial naive bayes and support vector machine as non-ensemble classifiers which achieve 0.43 and 0.74 f1-score respectively across 660 evaluation documents. We also compare our result against other research that using the same data and our approach achieved better results.

[1]  Manas Ranjan Patra,et al.  EVALUATING MACHINE LEARNING ALGORITHMS FOR DETECTING NETWORK INTRUSIONS , 2009 .

[2]  Ali A. Ghorbani,et al.  An overview of online fake news: Characterization, detection, and discussion , 2020, Inf. Process. Manag..

[3]  Cha Zhang,et al.  Ensemble Machine Learning: Methods and Applications , 2012 .

[4]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[5]  Johan A. K. Suykens,et al.  Regularization, Optimization, Kernels, and Support Vector Machines , 2014 .

[6]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[7]  Surjandy,et al.  The smartphone for disseminating of fake news by the university students game player , 2017, 2017 International Conference on Information Management and Technology (ICIMTech).

[8]  M. Wilscy,et al.  Random forest classifier based multi-document summarization system , 2013, 2013 IEEE Recent Advances in Intelligent Computational Systems (RAICS).

[9]  Prabhas Chongstitvatana,et al.  Detecting Fake News with Machine Learning Method , 2018, 2018 15th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON).

[10]  Yanqing Zhang,et al.  SVMs Modeling for Highly Imbalanced Classification , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[11]  Dik Lun Lee,et al.  iForest: Interpreting Random Forests via Visual Analytics , 2019, IEEE Transactions on Visualization and Computer Graphics.

[12]  Ashish Gupta,et al.  Detecting fake news for reducing misinformation risks using analytics approaches , 2019, Eur. J. Oper. Res..

[13]  Mohammed J. Zaki Data Mining and Analysis: Fundamental Concepts and Algorithms , 2014 .

[14]  See-Kiong Ng,et al.  Integrated Oversampling for Imbalanced Time Series Classification , 2013, IEEE Transactions on Knowledge and Data Engineering.

[15]  Huan Liu,et al.  Understanding User Profiles on Social Media for Fake News Detection , 2018, 2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR).

[16]  B. Nyhan,et al.  When Corrections Fail: The Persistence of Political Misperceptions , 2010 .

[17]  Rosa Andrie Asmara,et al.  Study of hoax news detection using naïve bayes classifier in Indonesian language , 2017, 2017 11th International Conference on Information & Communication Technology and System (ICTS).

[18]  Gaurav Jaiswal,et al.  Ensemble of Hybrid CNN-ELM Model for Image Classification , 2018, 2018 5th International Conference on Signal Processing and Integrated Networks (SPIN).

[19]  Francisco Herrera,et al.  Addressing covariate shift for Genetic Fuzzy Systems classifiers: A case of study with FARC-HD for imbalanced datasets , 2013, 2013 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE).

[20]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[21]  Aghus Sofwan,et al.  Hoax detection system on Indonesian news sites based on text classification using SVM and SGD , 2017, 2017 4th International Conference on Information Technology, Computer, and Electrical Engineering (ICITACEE).

[22]  Jason M. Klusowski Complete Analysis of a Random Forest Model , 2018, ArXiv.

[23]  Dario Pompili,et al.  Real-Time Epileptic Seizure Detection from EEG Signals via Random Subspace Ensemble Learning , 2016, 2016 IEEE International Conference on Autonomic Computing (ICAC).

[24]  Óscar W. Márquez Flórez,et al.  A Communication Perspective on Automatic Text Categorization , 2009, IEEE Transactions on Knowledge and Data Engineering.

[25]  Verónica Pérez-Rosas,et al.  Automatic Detection of Fake News , 2017, COLING.

[26]  David O. Klein,et al.  Fake News: A Legal Perspective , 2017 .

[27]  ChengXiang Zhai,et al.  Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining , 2016 .

[28]  J.G.R. Sathiaseelan,et al.  An Advanced Multi Class Instance Selection based Support Vector Machine for Text Classification , 2015 .

[29]  Joaquim Ferreira da Silva,et al.  Mining Concepts from Texts , 2012, ICCS.

[30]  T. Mouratis,et al.  Increasing the Accuracy of Discriminative of Multinomial Bayesian Classifier in Text Classification , 2009, 2009 Fourth International Conference on Computer Sciences and Convergence Information Technology.

[31]  Ian H. Witten,et al.  Data Mining: Practical Machine Learning Tools and Techniques, 3/E , 2014 .

[32]  Wahyu Catur Wibowo,et al.  Fake News Identification Characteristics Using Named Entity Recognition and Phrase Detection , 2018, 2018 10th International Conference on Information Technology and Electrical Engineering (ICITEE).

[33]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[34]  Ema Utami,et al.  Non-formal affixed word stemming in Indonesian language , 2018, 2018 International Conference on Information and Communications Technology (ICOIACT).

[35]  Dina Maulina,et al.  Klasifikasi Artikel Hoax Menggunakan Support Vector Machine Linear Dengan Pembobotan Term Frequency – Inverse Document Frequency , 2018 .

[36]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[37]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[38]  Edward Y. Chang,et al.  KBA: kernel boundary alignment considering imbalanced data distribution , 2005, IEEE Transactions on Knowledge and Data Engineering.

[39]  Yanxia Yang,et al.  Research and Realization of Internet Public Opinion Analysis Based on Improved TF - IDF Algorithm , 2017, 2017 16th International Symposium on Distributed Computing and Applications to Business, Engineering and Science (DCABES).