Multilingual Sentiment Analysis Using Latent Semantic Indexing and Machine Learning

We present a novel approach to predicting the sentiment of documents in multiple languages, without translation. The only prerequisite is a multilingual parallel corpus wherein a training sample of the documents, in a single language only, have been tagged with their overall sentiment. Latent Semantic Indexing (LSI) converts that multilingual corpus into a multilingual ``concept space''. Both training and test documents can be projected into that space, allowing cross-lingual semantic comparisons between the documents without the need for translation. Accordingly, the training documents with known sentiment are used to build a machine learning model which can, because of the multilingual nature of the document projections, be used to predict sentiment in the other languages. We explain and evaluate the accuracy of this approach. We also design and conduct experiments to investigate the extent to which topic and sentiment {\em separately} contribute to that classification accuracy, and thereby shed some initial light on the question of whether topic and sentiment can be sensibly teased apart.

[1]  Rada Mihalcea,et al.  Learning Multilingual Subjective Language via Cross-Lingual Projections , 2007, ACL.

[2]  Brett W. Bader,et al.  Enhancing Multilingual Latent Semantic Analysis with Term Alignment Information , 2008, COLING.

[3]  Kerstin Denecke,et al.  Using SentiWordNet for multilingual sentiment analysis , 2008, 2008 IEEE 24th International Conference on Data Engineering Workshop.

[4]  Paul G. Young Cross-Language Information Retrieval Using Latent Semantic Indexing , 1994 .

[5]  P. Groenen,et al.  Modern multidimensional scaling , 1996 .

[6]  P. Dhavachelvan,et al.  Precision at K in Multilingual Information Retrieval , 2011 .

[7]  Philip Resnik,et al.  The Bible as a Parallel Corpus: Annotating the ‘Book of 2000 Tongues’ , 1999, Comput. Humanit..

[8]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[9]  Lawrence O. Hall,et al.  A Comparison of Decision Tree Ensemble Creation Techniques , 2007 .

[10]  Susan T. Dumais,et al.  Improving the retrieval of information from external sources , 1991 .

[11]  W. Marsden I and J , 2012 .

[12]  Andrea Esuli,et al.  SENTIWORDNET: A Publicly Available Lexical Resource for Opinion Mining , 2006, LREC.

[13]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[14]  Bing Liu,et al.  Sentiment Analysis and Subjectivity , 2010, Handbook of Natural Language Processing.

[15]  Brett W. Bader,et al.  Algebraic Techniques for Multilingual Document Clustering , 2010 .

[16]  Peter D. Turney Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews , 2002, ACL.

[17]  Ahmed Abdelali,et al.  An information-theoretic, vector-space-model approach to cross-language information retrieval* , 2011, Natural Language Engineering.

[18]  Ahmed Abdelali,et al.  The Knowledge of Good and Evil : Multilingual Ideology Classification with PARAFAC 2 and Machine Learning , 2008 .

[19]  Susan T. Dumais,et al.  Improving information retrieval using latent semantic indexing , 1988 .

[20]  Ahmed Abdelali,et al.  Benefits of the 'Massively Parallel Rosetta Stone': Cross-Language Information Retrieval with over 30 Languages , 2007, ACL.

[21]  Patricio Martínez-Barco,et al.  Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA 2.011) , 2011, WASSA@ACL.

[22]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[23]  Michael W. Berry,et al.  Text mining : applications and theory , 2010 .

[24]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[25]  Adrian Iftene,et al.  Sentimatrix - Multilingual Sentiment Analysis Service , 2011, WASSA@ACL.

[26]  M. Bradley,et al.  Affective Norms for English Words (ANEW): Instruction Manual and Affective Ratings , 1999 .