Sentiment analysis as a text categorization task: A study on feature and algorithm selection for Italian language

The availability on the Internet of huge amounts of blog posts, messages and comments allows to study the attitude of people on various topics. Sentiment Analysis, Opinion Mining and Emotion Analysis denote the area of research in Computer Science aimed at studying, analyzing and classifying text documents based on the underlying opinions expressed by their authors on various topics. While this is a tough task, because it is related to psychological aspects that are not always immediately evident in the lexical and syntactical aspects of the sentences, its importance may be paramount for several applications such as market analysis, political polls, etc. Fundamental pre-processing techniques for this task come from the area of Natural Language Processing, which may pose additional problems when the language of interest is different than English, and thus less (or less reliable) resources are available to extract the needed data from the text. This paper studies the performance of Sentiment Analysis, seen as a Text Categorization task, depending on the use of different classifiers and different features. While the approach is general, we focus on texts in Italian. The outcomes suggest which experimental settings can be most profitably used in this landscape, and show that significantly good results can be obtained.

[1]  Martin Porter,et al.  Snowball: A language for stemming algorithms , 2001 .

[2]  Satoshi Morinaga,et al.  Mining product reputations on the Web , 2002, KDD.

[3]  Giuseppe Carenini,et al.  Extracting knowledge from evaluative text , 2005, K-CAP '05.

[4]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[5]  Malvina Nissim,et al.  Sentiment analysis on Italian tweets , 2013, WASSA@NAACL-HLT.

[6]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.

[7]  Stefano Ferilli,et al.  Automatic Learning of Linguistic Resources for Stopword Removal and Stemming from Text , 2014, IRCDL.

[8]  Bing Liu,et al.  Sentiment Analysis and Subjectivity , 2010, Handbook of Natural Language Processing.

[9]  Venu Govindaraju,et al.  Review of Classifier Combination Methods , 2008, Machine Learning in Document Analysis and Recognition.

[10]  Bing Liu,et al.  Opinion observer: analyzing and comparing opinions on the Web , 2005, WWW '05.

[11]  Johannes Fürnkranz,et al.  A Study Using $n$-gram Features for Text Categorization , 1998 .

[12]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[13]  Hong Yu,et al.  Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying the Polarity of Opinion Sentences , 2003, EMNLP.

[14]  Janusz S. Bień,et al.  Beliefs, Points of View, and Multiple Environments , 1983, Cogn. Sci..

[15]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[16]  Philip S. Yu,et al.  A holistic lexicon-based approach to opinion mining , 2008, WSDM '08.

[17]  Colin Cherry,et al.  Binary Classifiers and Latent Sequence Models for Emotion Detection in Suicide Notes , 2012, Biomedical informatics insights.

[18]  David M. Pennock,et al.  Mining the peanut gallery: opinion extraction and semantic classification of product reviews , 2003, WWW '03.

[19]  Janyce Wiebe,et al.  Learning Subjective Adjectives from Corpora , 2000, AAAI/IAAI.

[20]  Oren Etzioni,et al.  Extracting Product Features and Opinions from Reviews , 2005, HLT.

[21]  Ke Xu,et al.  MoodLens: an emoticon-based sentiment analysis system for chinese tweets , 2012, KDD.

[22]  Peter D. Turney Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews , 2002, ACL.

[23]  J. Carbonell Subjective understanding, computer models of belief systems , 1981 .

[24]  Jeonghee Yi,et al.  Sentiment analysis: capturing favorability using natural language processing , 2003, K-CAP '03.

[25]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[26]  Razvan C. Bunescu,et al.  Sentiment analyzer: extracting sentiments about a given topic using natural language processing techniques , 2003, Third IEEE International Conference on Data Mining.

[27]  Gilad Mishne,et al.  Predicting Movie Sales from Blogger Sentiment , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[28]  Eric Brill,et al.  Unsupervised Learning of Disambiguation Rules for Part of Speech Tagging , 1995, VLC@ACL.

[29]  Janyce Wiebe,et al.  Development and Use of a Gold-Standard Data Set for Subjectivity Classifications , 1999, ACL.

[30]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.