An approach to the use of word embeddings in an opinion classification task

Vector-based word representations can help to improve a document classifier.The information of word2vec vectors and bags of words are very complementary.The combination of word2vec and BOW word representations obtains the best results.Word2vec is much more stable than bag of words models in cross-domain experiments. In this paper we show how a vector-based word representation obtained via word2vec can help to improve the results of a document classifier based on bags of words. Both models allow obtaining numeric representations from texts, but they do it very differently. The bag of words model can represent documents by means of widely dispersed vectors in which the indices are words or groups of words. word2vec generates word level representations building vectors that are much more compact, where indices implicitly contain information about the context of word occurrences. Bags of words are very effective for document classification and in our experiments no representation using only word2vec vectors is able to improve their results. However, this does not mean that the information provided by word2vec is not useful for the classification task. When this information is used in combination with the bags of words, the results are improved, showing its complementarity and its contribution to the task. We have also performed cross-domain experiments in which word2vec has shown much more stable behavior than bag of words models.

[1]  Fabrizio Silvestri,et al.  Context- and Content-aware Embeddings for Query Rewriting in Sponsored Search , 2015, SIGIR.

[2]  Hao Wu,et al.  Hierarchical Neural Language Models for Joint Representation of Streaming Documents and their Content , 2015, WWW.

[3]  Roberto Saia,et al.  Using neural word embeddings to model user behavior and detect user segments , 2016, Knowl. Based Syst..

[4]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[5]  Paolo Rosso,et al.  Cross-domain polarity classification using a knowledge-enhanced meta-classifier , 2015, Knowl. Based Syst..

[6]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[7]  Anton van den Hengel,et al.  Image-Based Recommendations on Styles and Substitutes , 2015, SIGIR.

[8]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[9]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[10]  Danushka Bollegala,et al.  Cross-Domain Sentiment Classification Using a Sentiment Sensitive Thesaurus , 2013, IEEE Transactions on Knowledge and Data Engineering.

[11]  Kezhi Mao,et al.  Learning multi-prototype word embedding from single-prototype word embedding with integrated knowledge , 2016, Expert Syst. Appl..

[12]  Chengqing Zong,et al.  Multi-domain Sentiment Classification , 2008, ACL.

[13]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[14]  Ruslan Salakhutdinov,et al.  A Multiplicative Model for Learning Distributed Text-Based Attribute Representations , 2014, NIPS.

[15]  Ruslan Salakhutdinov,et al.  Multimodal Neural Language Models , 2014, ICML.

[16]  Danqi Chen,et al.  Reasoning With Neural Tensor Networks for Knowledge Base Completion , 2013, NIPS.

[17]  Jason Weston,et al.  Translating Embeddings for Modeling Multi-relational Data , 2013, NIPS.

[18]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[19]  Hae-Chang Rim,et al.  Knowledge-based question answering using the semantic embedding space , 2015, Expert Syst. Appl..

[20]  Steven Skiena,et al.  DeepWalk: online learning of social representations , 2014, KDD.

[21]  Jure Leskovec,et al.  Inferring Networks of Substitutable and Complementary Products , 2015, KDD.