Sentiment classification of documents in Serbian: The effects of morphological normalization

Sentiment classification of texts written in Serbian is still an under-researched topic. One of the open issues is how the different forms of morphological normalization affect the performances of different sentiment classifiers and which normalization procedure is optimal for this task. In this paper we assess and compare the impact of lemmatizers and stemmers for Serbian on classifiers trained and evaluated on the Serbian Movie Review Dataset.

[1]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[2]  Helmut Schmid,et al.  Improvements in Part-of-Speech Tagging with an Application to German , 1999 .

[3]  Nikola Ljubesic,et al.  New Inflectional Lexicons and Training Corpora for Improved Morphosyntactic Annotation of Croatian and Serbian , 2016, LREC.

[4]  Christopher D. Manning,et al.  Baselines and Bigrams: Simple, Good Sentiment and Topic Classification , 2012, ACL.

[5]  Nikola Ljubesic,et al.  Lemmatization and Morphosyntactic Tagging of Croatian and Serbian , 2013, BSNLP@ACL.

[6]  Tanja Samardzic,et al.  Lemmatisation as a Tagging Task , 2012, ACL.

[7]  Remco R. Bouckaert,et al.  Choosing Between Two Learning Algorithms Based on Calibrated Tests , 2003, ICML.

[8]  Eibe Frank,et al.  Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms , 2004, PAKDD.

[9]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[10]  Bosko Nikolic,et al.  Reliable Baselines for Sentiment Analysis in Resource-Limited Languages: The Serbian Movie Review Dataset , 2016, LREC.

[11]  Nikola Ljubešić,et al.  Retrieving Information in Croatian : Building a Simple and Efficient Rule-Based Stemmer , 2007 .

[12]  Nikola Ljubesic,et al.  Regional Linguistic Data Initiative (ReLDI) , 2015, BSNLP@RANLP.

[13]  Danko Šipka,et al.  A suffix subsumption-based approach to building stemmers and lemmatizers for highly inflectional languages with sparse resource , 2008 .

[14]  András Kornai,et al.  HunPos: an open source trigram tagger , 2007, ACL 2007.

[15]  Tanja Samardzic,et al.  Lemmatising Serbian as Category Tagging with Bidirectional Sequence Classification , 2012, LREC.

[16]  Attila Novák,et al.  PurePos 2.0: a hybrid tool for morphological disambiguation , 2013, RANLP.

[17]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[18]  Nada Lavrac,et al.  LemmaGen: Multilingual Lemmatisation with Induced Ripple-Down Rules , 2010, J. Univers. Comput. Sci..

[19]  Hercules Dalianis,et al.  Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike , 2009, ACL.

[20]  Nikola Milosevic Stemmer for Serbian language , 2012, ArXiv.