Improvement of Sentiment Analysis Based on Clustering of Word2Vec Features

Recently, many researchers have shown interest in using Word2Vec as the features for text classification tasks such as sentiment analysis. Its ability to model high quality distributional semantics among words has contributed to its success in many of the tasks. However, due to the high dimensional nature of the Word2Vec features, it increases the complexity for the classifier. In this paper, a method to construct a feature set based on Word2Vec is proposed for sentiment analysis. The method is based on clustering of terms in the vocabulary based on a set of opinion words from a sentiment lexical dictionary. As a result, the feature set for the classification is constructed based on the set of clusters. The effectiveness of the proposed method is evaluated on the Internet Movie Review Dataset with two classifiers, namely the Support Vector Machine and the Logistic Regression. The result is promising, showing that the proposed method can be more effective than the baseline approaches.

[1]  Haixun Wang,et al.  Learning Term Embeddings for Hypernymy Identification , 2015, IJCAI.

[2]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[3]  Bing Liu,et al.  Opinion observer: analyzing and comparing opinions on the Web , 2005, WWW '05.

[4]  Juan Pablo Fernández,et al.  Vector-based word representations for sentiment analysis: a comparative study , 2016 .

[5]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[6]  Jiajun Zhang,et al.  Deep Neural Networks in Machine Translation: An Overview , 2015, IEEE Intelligent Systems.

[7]  Wooju Kim,et al.  Sentiment classification for unlabeled dataset using Doc2Vec with JST , 2016, ICEC.

[8]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[9]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[10]  K. Robert Lai,et al.  Predicting Valence-Arousal Ratings of Words Using a Weighted Graph Method , 2015, ACL.

[11]  Santanu Kumar Rath,et al.  Classification of sentiment reviews using n-gram machine learning approach , 2016, Expert Syst. Appl..

[12]  Jürgen Schmidhuber,et al.  Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition , 2005, ICANN.

[13]  Hadi Pouransari,et al.  Deep learning for sentiment analysis of movie reviews , 2015 .

[14]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[15]  Hugo Jair Escalante,et al.  INAOE's Participation at PAN'13: Author Profiling Task Notebook for PAN at CLEF 2013 , 2013, CLEF.

[16]  W. Bruce Croft,et al.  Statistical language modeling for information retrieval , 2006, Annu. Rev. Inf. Sci. Technol..

[17]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[18]  R Nikhil,et al.  A Survey on Text Mining and Sentiment Analysis for Unstructured Web Data , 2015 .

[19]  Fabrizio Sebastiani,et al.  Distributional term representations: an experimental comparison , 2004, CIKM '04.

[20]  Enhong Chen,et al.  A new approach to query segmentation for relevance ranking in web search , 2013, Information Retrieval Journal.

[21]  Charu C. Aggarwal,et al.  A Survey of Text Clustering Algorithms , 2012, Mining Text Data.

[22]  Zornitsa Kozareva,et al.  Sentiment Prediction Using Collaborative Filtering , 2013, ICWSM.

[23]  Xin Li,et al.  Apply word vectors for sentiment analysis of APP reviews , 2016, 2016 3rd International Conference on Systems and Informatics (ICSAI).

[24]  Athena Vakali,et al.  Sentiment analysis leveraging emotions and word embeddings , 2017 .

[25]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[26]  A. Azman,et al.  An Evaluation of Factors Affecting Brand Awareness in the Context of Social Media in Malaysia , 2013 .

[27]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[28]  Marc'Aurelio Ranzato,et al.  Learning Longer Memory in Recurrent Neural Networks , 2014, ICLR.

[29]  Timothy Baldwin,et al.  An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation , 2016, Rep4NLP@ACL.

[30]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.