Reprint of "Supervised sentiment analysis in Czech social media"

We explore state-of-the-art supervised machine learning methods for sentiment analysis of Czech social media.We provide a large human-annotated Czech social media corpus.We explore different pre-processing techniques and employ various features and classifiers.We experiment with five different feature selection algorithms.Results are also reported on other widely popular domains, such as movie and product reviews. This article describes in-depth research on machine learning methods for sentiment analysis of Czech social media. Whereas in English, Chinese, or Spanish this field has a long history and evaluation datasets for various domains are widely available, in the case of the Czech language no systematic research has yet been conducted. We tackle this issue and establish a common ground for further research by providing a large human-annotated Czech social media corpus. Furthermore, we evaluate state-of-the-art supervised machine learning methods for sentiment analysis. We explore different pre-processing techniques and employ various features and classifiers. We also experiment with five different feature selection algorithms and investigate the influence of named entity recognition and preprocessing on sentiment classification performance. Moreover, in addition to our newly created social media dataset, we also report results for other popular domains, such as movie and product reviews. We believe that this article will not only extend the current sentiment analysis research to another family of languages, but will also encourage competition, potentially leading to the production of high-end commercial solutions.

[1]  Hsinchun Chen,et al.  Sentiment analysis in multiple languages: Feature selection for opinion classification in Web forums , 2008, TOIS.

[2]  Timothy W. Finin,et al.  Delta TFIDF: An Improved Feature Space for Sentiment Analysis , 2009, ICWSM.

[3]  Marie-Francine Moens,et al.  A machine learning approach to sentiment analysis in multilingual Web texts , 2009, Information Retrieval.

[4]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[5]  Mike Thelwall,et al.  A Study of Information Retrieval Weighting Schemes for Sentiment Analysis , 2010, ACL.

[6]  Pei-Chann Chang,et al.  Using a contextual entropy model to expand emotion words and their intensity for the sentiment classification of stock market news , 2013, Knowl. Based Syst..

[7]  Houkuan Huang,et al.  Feature selection for text classification with Naïve Bayes , 2009, Expert Syst. Appl..

[8]  Lei Liu,et al.  Feature selection with dynamic mutual information , 2009, Pattern Recognit..

[9]  Rohini K. Srihari,et al.  Feature selection for text categorization on imbalanced data , 2004, SKDD.

[10]  Josef Steinberger,et al.  Multilingual Entity-Centered Sentiment Analysis Evaluated by Parallel Corpora , 2011, RANLP.

[11]  Hsinchun Chen,et al.  Selecting Attributes for Sentiment Classification Using Feature Relation Networks , 2011, IEEE Transactions on Knowledge and Data Engineering.

[12]  Patrick Paroubek,et al.  Twitter as a Corpus for Sentiment Analysis and Opinion Mining , 2010, LREC.

[13]  Josef Steinberger,et al.  Creating Sentiment Dictionaries via Triangulation , 2011, Decis. Support Syst..

[14]  Xue-wen Chen,et al.  Combating the Small Sample Class Imbalance Problem Using Feature Selection , 2010, IEEE Transactions on Knowledge and Data Engineering.

[15]  Dilek Z. Hakkani-Tür,et al.  Probabilistic model-based sentiment analysis of twitter messages , 2010, 2010 IEEE Spoken Language Technology Workshop.

[16]  James Pustejovsky,et al.  Natural Language Annotation for Machine Learning , 2012 .

[17]  Evgeny A. Stepanov,et al.  Detecting General Opinions from Customer Surveys , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[18]  Tom Crick,et al.  R U : -) or : -( ? Character- vs. Word-Gram Feature Selection for Sentiment Classification of OSN Corpora , 2012, SGAI Conf..

[19]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[20]  Luis Alfonso Ureña López,et al.  Random Walk Weighting over SentiWordNet for Sentiment Polarity Detection on Twitter , 2012, WASSA@ACL.

[21]  Brendan T. O'Connor,et al.  Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments , 2010, ACL.

[22]  Zulaiha Ali Othman,et al.  Opinion Mining and Sentiment Analysis: A Survey , 2012, BIOINFORMATICS 2012.

[23]  Themis Palpanas,et al.  Survey on mining subjective data on the web , 2011, Data Mining and Knowledge Discovery.

[24]  Tomas Brychcin,et al.  Semantic Spaces for Sentiment Analysis , 2013, TSD.

[25]  Jacques Savoy,et al.  Indexing and stemming approaches for the Czech language , 2009, Inf. Process. Manag..

[26]  Maite Taboada,et al.  Lexicon-Based Methods for Sentiment Analysis , 2011, CL.

[27]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.

[28]  Marie Mikulová,et al.  Prague Dependency Treebank 2.0 (PDT 2.0) , 2006 .

[29]  Jan Hajic,et al.  Creating annotated resources for polarity classification in Czech , 2012, KONVENS.

[30]  Michal Konkol,et al.  CRF-Based Czech Named Entity Recognizer and Consolidation of Czech NER Research , 2013, TSD.

[31]  Johanna D. Moore,et al.  Twitter Sentiment Analysis: The Good the Bad and the OMG! , 2011, ICWSM.

[32]  Alexandra Balahur,et al.  Detecting Entity-Related Events and Sentiments from Tweets Using Multilingual Resources , 2012, CLEF.

[33]  Tomas Brychcin,et al.  Unsupervised Improving of Sentiment Analysis Using Global Target Context , 2013, RANLP.

[34]  Eugénio C. Oliveira,et al.  Tokenizing micro-blogging messages using a text classification approach , 2010, AND '10.

[35]  Dan Zhang,et al.  Sentiment detection with auxiliary data , 2012, Information Retrieval.

[36]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[37]  Jakub Kanis,et al.  Comparison of Different Lemmatization Approaches through the Means of Information Retrieval Performance , 2010, TSD.

[38]  Owen Rambow,et al.  Sentiment Analysis of Twitter Data , 2011 .

[39]  Josef Steinberger,et al.  Sentiment Analysis in Czech Social Media Using Supervised Machine Learning , 2013, WASSA@NAACL-HLT.

[40]  Lei Zhang,et al.  A Survey of Opinion Mining and Sentiment Analysis , 2012, Mining Text Data.

[41]  Nasser Ghasem-Aghaee,et al.  Text feature selection using ant colony optimization , 2009, Expert Syst. Appl..

[42]  Wei-keng Liao,et al.  SES: Sentiment Elicitation System for Social Media Data , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[43]  Shubhamoy Dey,et al.  A comparative study of feature selection and machine learning techniques for sentiment analysis , 2012, RACS.

[44]  Gulden Uchyigit,et al.  Experimental evaluation of feature selection methods for text classification , 2012, 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery.