The impact of different training data set on the accuracy of sentiment classification of Naïve Bayes technique

This paper attempts to examine the impact of different number of training data set on the accuracy of sentiment classification using Naïve Bayes techniques. In this study, sentiments are classified into three categories; namely positive, negative or neutral. There are five different training data sets used in this study; 5, 10, 25, 50 and 100 tweets. Five users are involved and they are required to classify the sentiment of the different training date set based on specific keyword. The training results are then used as the input for Naïve Bayes training for another 25 tweets. Subsequently, the users are asked to validate the results of sentiment classification by the Naïve Bayes technique. The accuracy of this study are 46% ± 15% for 5 training data set, 78% ± 16% for 10 training data set, 89% ± 14% for 25 training data set, 87% ± 11% for 50 training data set and 79% ± 10% for 100 training data set which are is measured by total number of correct per total classified tweets.