Enhanced Naive Bayes Classifier for real-time sentiment analysis with SparkR

Correct and fast sentiment analysis of continuously generated data such as Twitter message is very important for providing real-time customized service to the users. While Naive Bayes Classifier(NBC) is the most popular classifier employed for sentiment analysis, the existing studies on it have been based on single server environment. Consequently, they are not adequate for handling real-time stream data. In this paper, thus, we propose a scheme adopting the Laplace Smoothing technique with Binarized NBC for enhancing the accuracy, and employing SparkR for speed-up via distributed and parallel processing. Computer simulation with Sentiment140 reveals that the proposed approach consistently allows higher accuracy than the existing schemes. It also identifies that the SparkR environment allows faster training than R.

[1]  George Ostrouchov,et al.  Programming with Big Data – Base Wrappers for DistributedMatrices , 2013 .

[2]  Alice H. Oh,et al.  Do You Feel What I Feel? Social Aspects of Emotions in Twitter Conversations , 2012, ICWSM.

[3]  Lillian Lee,et al.  Opinion Mining and Sentiment Analysis , 2008, Found. Trends Inf. Retr..

[4]  E. Athanasopoulou,et al.  Logitboost of Multinomial Bayesian Classifier for Text Classification , 2006 .

[5]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[6]  Nivet Chirawichitchai Sentiment classification by a hybrid method of greedy search and multinomial naïve bayes algorithm , 2013, 2013 Eleventh International Conference on ICT and Knowledge Engineering.

[7]  K. R. Chandran,et al.  Naïve Bayes text classification with positive features selected by statistical method , 2009, 2009 First International Conference on Advanced Computing.

[8]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[9]  Stan Matwin,et al.  Large Scale Text Classification using Semisupervised Multinomial Naive Bayes , 2011, ICML.

[10]  Baharum Baharudin,et al.  Sentiment classification using sentence-level semantic orientation of opinion terms from blogs , 2011, 2011 National Postgraduate Conference.

[11]  Niall Gaffney,et al.  Performance evaluation of R with Intel Xeon Phi coprocessor , 2013, 2013 IEEE International Conference on Big Data.

[12]  Weijia Xu,et al.  Performance evaluation of enabling logistic regression for big data with R , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[13]  Dilara Torunoglu,et al.  Wikipedia based semantic smoothing for twitter sentiment classification , 2013, 2013 IEEE INISTA.

[14]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[15]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[16]  Genshe Chen,et al.  Scalable sentiment classification for Big Data analysis using Naïve Bayes Classifier , 2013, 2013 IEEE International Conference on Big Data.

[17]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[18]  Geoff Holmes,et al.  Multinomial Naive Bayes for Text Categorization Revisited , 2004, Australian Conference on Artificial Intelligence.

[19]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[20]  Yogesh Singh,et al.  A REVIEW OF STUDIES ON MACHINE LEARNING TECHNIQUES , 2007 .