A Large-Scale Sentiment Data Classification for Online Reviews Under Apache Spark

Abstract Sentiment Analysis of large-scale data has become increasingly important and has attracted many researchers, urging them to use new platforms and tools that can handle large volumes of data. In this paper, we present new evaluation experiments of sentiment analysis for a large-scale dataset of online customer’s reviews under Apache Spark data Processing System. Apache Spark’s scalable machine learning library (MLlib) is used and three classification techniques from the library are applied; Naive Bayes, Support vector machine, and logistic regression. The results are evaluated using the accuracy metric. Experimental results show that Support vector machine classifier outperforms Naive Bayes and logistic regression classifiers.