论文信息 - Scalable sentiment classification for Big Data analysis using Naïve Bayes Classifier

Scalable sentiment classification for Big Data analysis using Naïve Bayes Classifier

A typical method to obtain valuable information is to extract the sentiment or opinion from a message. Machine learning technologies are widely used in sentiment classification because of their ability to “learn” from the training dataset to predict or support decision making with relatively high accuracy. However, when the dataset is large, some algorithms might not scale up well. In this paper, we aim to evaluate the scalability of Naïve Bayes classifier (NBC) in large datasets. Instead of using a standard library (e.g., Mahout), we implemented NBC to achieve fine-grain control of the analysis procedure. A Big Data analyzing system is also design for this study. The result is encouraging in that the accuracy of NBC is improved and approaches 82% when the dataset size increases. We have demonstrated that NBC is able to scale up to analyze the sentiment of millions movie reviews with increasing throughput.

[1] Christos Faloutsos,et al. Mining large graphs: Algorithms, inference, and discoveries , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[2] Bo Pang,et al. Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[3] Christos Faloutsos,et al. Pegasus: Mining billion-scale graphs in the cloud , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4] Eric K. Ringger,et al. Pulse: Mining Customer Opinions from Free Text , 2005, IDA.

[5] Pedro M. Domingos,et al. On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[6] Matt Thomas,et al. Get out the vote: Determining support or opposition from Congressional floor-debate transcripts , 2006, EMNLP.

[7] Sergei Vassilvitskii,et al. Counting triangles and the curse of the last reducer , 2011, WWW.

[8] David D. Lewis,et al. Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[9] Genshe Chen,et al. Information fusion in a cloud computing era: A systems-level perspective , 2014, IEEE Aerospace and Electronic Systems Magazine.

[10] Erik Blasch,et al. A Holistic Cloud-Enabled Robotics System for Real-Time Video Tracking Application , 2014 .

[11] Rohini K. Srihari,et al. Using Verbs and Adjectives to Automatically Classify Blog Sentiment , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[12] Sergei Vassilvitskii,et al. A model of computation for MapReduce , 2010, SODA '10.

[13] Alistair Kennedy,et al. SENTIMENT CLASSIFICATION of MOVIE REVIEWS USING CONTEXTUAL VALENCE SHIFTERS , 2006, Comput. Intell..

[14] Bo Pang,et al. A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[15] Erik Blasch,et al. Revisiting the JDL model for information exploitation , 2013, Proceedings of the 16th International Conference on Information Fusion.

[16] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[17] Erik Blasch,et al. Information Fusion in a Cloud-Enabled Environment , 2014 .