Sentiment classification of documents in Serbian: The effects of morphological normalization and word embeddings

An open issue in the sentiment classification of texts written in Serbian is the effect of different forms of morphological normalization and the usefulness of leveraging large amounts of unlabeled texts. In this paper, we assess the impact of lemmatizers and stemmers for Serbian on classifiers trained and evaluated on the Serbian Movie Review Dataset. We also consider the effectiveness of using word embeddings, generated from a large unlabeled corpus, as classification features.

[1]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[2]  Ting Liu,et al.  Deep learning for sentiment analysis: successful approaches and future challenges , 2015, WIREs Data Mining Knowl. Discov..

[3]  Helmut Schmid,et al.  Improvements in Part-of-Speech Tagging with an Application to German , 1999 .

[4]  Jan Snajder,et al.  Comparison of Short-Text Sentiment Analysis Methods for Croatian , 2017, BSNLP@EACL.

[5]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[6]  Bosko Nikolic,et al.  Reliable Baselines for Sentiment Analysis in Resource-Limited Languages: The Serbian Movie Review Dataset , 2016, LREC.

[7]  Danko Šipka,et al.  A suffix subsumption-based approach to building stemmers and lemmatizers for highly inflectional languages with sparse resource , 2008 .

[8]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[9]  Christopher D. Manning,et al.  Baselines and Bigrams: Simple, Good Sentiment and Topic Classification , 2012, ACL.

[10]  Nikola Ljubešić,et al.  Retrieving Information in Croatian : Building a Simple and Efficient Rule-Based Stemmer , 2007 .

[11]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[12]  Attila Novák,et al.  PurePos 2.0: a hybrid tool for morphological disambiguation , 2013, RANLP.

[13]  Nikola Ljubesic,et al.  Regional Linguistic Data Initiative (ReLDI) , 2015, BSNLP@RANLP.

[14]  Nikola Ljubesic,et al.  {bs,hr,sr}WaC - Web Corpora of Bosnian, Croatian and Serbian , 2014, WaC@EACL.

[15]  Eibe Frank,et al.  Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms , 2004, PAKDD.

[16]  Vuk Batanović,et al.  Sentiment classification of documents in Serbian: The effects of morphological normalization , 2016, 2016 24th Telecommunications Forum (TELFOR).

[17]  Nikola Ljubesic,et al.  New Inflectional Lexicons and Training Corpora for Improved Morphosyntactic Annotation of Croatian and Serbian , 2016, LREC.

[18]  Tanja Samardzic,et al.  Lemmatisation as a Tagging Task , 2012, ACL.

[19]  Tanja Samardzic,et al.  Lemmatising Serbian as Category Tagging with Bidirectional Sequence Classification , 2012, LREC.

[20]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[21]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[22]  Nikola Milosevic Stemmer for Serbian language , 2012, ArXiv.

[23]  Remco R. Bouckaert,et al.  Choosing Between Two Learning Algorithms Based on Calibrated Tests , 2003, ICML.

[24]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[25]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[26]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[27]  András Kornai,et al.  HunPos: an open source trigram tagger , 2007, ACL 2007.

[28]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[29]  Nikola Ljubesic,et al.  Lemmatization and Morphosyntactic Tagging of Croatian and Serbian , 2013, BSNLP@ACL.

[30]  Xiaoyong Du,et al.  Weighted Neural Bag-of-n-grams Model: New Baselines for Text Classification , 2016, COLING.

[31]  Nada Lavrac,et al.  LemmaGen: Multilingual Lemmatisation with Induced Ripple-Down Rules , 2010, J. Univers. Comput. Sci..

[32]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[33]  Hercules Dalianis,et al.  Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike , 2009, ACL.