An Evaluation of Preprocessing Steps and Tree-based Ensemble Machine Learning for Analysing Sentiment on Indonesian YouTube Comments

This study aims to find the best model for the sentiment analysis of Indonesian YouTube video comments Datasets crawled from YouTube video comments about government services related to COVID-19 pandemic in Indonesia There are two opinion datasets obtained from two different domains, different characteristics, and errors The problem is that comments from YouTube videos are very unstructured, containing spelling, diction, and slang word errors The scenario for the solution of the problem is to test several preprocessing techniques, including standard preprocessing such as stop word removal, slang word, emoticon conversion, and stemming Feature extraction using count vectorizer and TF-IDF method For the development of the model, five types of models were tested, namely Naive Bayes (NB), Support Vector Machine (SVM), Decision Tree, Random Forest, and Extra Tree classifier The result is a model with a maximum accuracy of 89 68% using a combination of standard preprocessing (converting emoticons and handling unstructured words), the count vectorizer feature extraction, and Extra Tree model classifier © 2020, World Academy of Research in Science and Engineering All rights reserved