FinnSentiment: a Finnish social media corpus for sentiment polarity annotation

Sentiment analysis and opinion mining is an important task with obvious application areas in social media, e.g. when indicating hate speech and fake news. In our survey of previous work, we note that there is no large-scale social media data set with sentiment polarity annotations for Finnish. This publications aims to remedy this shortcoming by introducing a 27,000 sentence data set annotated independently with sentiment polarity by three native annotators. We had the same three annotators for the whole data set, which provides a unique opportunity for further studies of annotator behaviour over time. We analyse their inter-annotator agreement and provide two baselines to validate the usefulness of the data set.

[1]  R. Plutchik A GENERAL PSYCHOEVOLUTIONARY THEORY OF EMOTION , 1980 .

[2]  Mike Thelwall,et al.  Sentiment strength detection for the social web , 2012, J. Assoc. Inf. Sci. Technol..

[3]  Víctor Rodríguez-Doncel,et al.  Spanish corpora for sentiment analysis: a survey , 2020, Lang. Resour. Evaluation.

[4]  Jörg Tiedemann,et al.  The Challenges of Multi-dimensional Sentiment Analysis Across Languages , 2016, PEOPLES@COLING.

[5]  Preslav Nakov,et al.  SemEval-2015 Task 10: Sentiment Analysis in Twitter , 2015, *SEMEVAL.

[6]  Marianna Apidianaki,et al.  Datasets for Aspect-Based Sentiment Analysis in French , 2016, LREC.

[7]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[8]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[9]  John Kaustinen Sentiment analysis of Finnish movie reviews : Extracting sentiment from texts in a morphologically rich language , 2018 .

[10]  Hsin-Hsi Chen,et al.  Construction of a Chinese Opinion Treebank , 2010, LREC.

[11]  Vadlamani Ravi,et al.  A survey on opinion mining and sentiment analysis: Tasks, approaches and applications , 2015, Knowl. Based Syst..

[12]  Jörg Tiedemann,et al.  XED: A Multilingual Dataset for Sentiment Analysis and Emotion Detection , 2020, COLING.

[13]  Ritika Dhania,et al.  Sentiment Analysis using Machine Learning , 2017 .

[14]  Philipp Cimiano,et al.  The USAGE review corpus for fine grained multi lingual opinion analysis , 2014, LREC.

[15]  Jörg Tiedemann,et al.  Creating a Dataset for Multilingual Fine-grained Emotion-detection Using Gamification-based Annotation , 2018, WASSA@EMNLP.

[16]  Veronika Vincze,et al.  A Hungarian Sentiment Corpus Manually Annotated at Aspect Level , 2016, LREC.

[17]  Mika V. Mäntylä,et al.  The evolution of sentiment analysis - A review of research topics, venues, and top cited papers , 2016, Comput. Sci. Rev..

[18]  Preslav Nakov,et al.  SemEval-2016 Task 4: Sentiment Analysis in Twitter. , 2019 .

[19]  Jouko Vankka,et al.  Sentiment Analysis of Finnish Customer Reviews , 2019, 2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS).

[20]  Giacomo Berardi,et al.  A Multi-lingual Annotated Dataset for Aspect-Oriented Opinion Mining , 2015, EMNLP.

[21]  Claire Cardie,et al.  Annotating Expressions of Opinions and Emotions in Language , 2005, Lang. Resour. Evaluation.

[22]  Paul Piwek,et al.  Sentiment and Behaviour Annotation in a Corpus of Dialogue Summaries , 2015, J. Univers. Comput. Sci..

[23]  Lei Zhang,et al.  Sentiment Analysis and Opinion Mining , 2017, Encyclopedia of Machine Learning and Data Mining.

[24]  Hyopil Shin,et al.  Specifications and Analysis of the Korean Sentiment Analysis Corpus , 2013 .

[25]  Ronen Feldman,et al.  Techniques and applications for sentiment analysis , 2013, CACM.

[26]  Shuai Wang,et al.  Deep learning for sentiment analysis: A survey , 2018, WIREs Data Mining Knowl. Discov..

[27]  Mark Cieliebak,et al.  SB-CH: A Swiss German Corpus with Sentiment Annotations , 2018, International Conference on Language Resources and Evaluation.

[28]  Johannes Einolander Deeper customer insight from NPS-questionnaires with text mining - Comparison of Machine, Representation and Deep Learning models in Finnish language sentiment classification , 2019 .

[29]  Preslav Nakov,et al.  SemEval-2016 Task 4: Sentiment Analysis in Twitter , 2016, *SEMEVAL.

[30]  Hyopil Shin,et al.  KOSAC: A Full-Fledged Korean Sentiment Analysis Corpus , 2013, PACLIC.

[31]  Fabio Crestani,et al.  Like It or Not , 2016, ACM Comput. Surv..

[32]  Roman Klinger,et al.  An Analysis of Annotated Corpora for Emotion Classification in Text , 2018, COLING.

[33]  Paolo Rosso,et al.  SemEval-2015 Task 11: Sentiment Analysis of Figurative Language in Twitter , 2015, *SEMEVAL.

[34]  Andrew Cattle,et al.  Annotation Scheme for Constructing Sentiment Corpus in Korean , 2012, PACLIC.

[35]  Ville Nukarinen Automated text sentiment analysis for Finnish language using deep learning , 2018 .

[36]  RossoPaolo,et al.  Emotion and sentiment in social and expressive media , 2016 .

[37]  Katarina Boland,et al.  Creating an Annotated Corpus for Sentiment Analysis of German Product Reviews , 2013 .

[38]  Patrick Paroubek,et al.  Twitter as a Corpus for Sentiment Analysis and Opinion Mining , 2010, LREC.

[39]  Carlo Strapparava,et al.  SemEval-2007 Task 14: Affective Text , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[40]  Preslav Nakov,et al.  SemEval-2014 Task 9: Sentiment Analysis in Twitter , 2014, *SEMEVAL.

[41]  Navneet Kaur,et al.  Opinion mining and sentiment analysis , 2016, 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom).

[42]  Luís Sarmento,et al.  Liars and Saviors in a Sentiment Annotated Corpus of Comments to Political Debates , 2011, ACL.

[43]  Ankur Sinha,et al.  Gold-standard for Topic-specific Sentiment Analysis of Economic Texts , 2014, LREC.

[44]  Klaus Krippendorff,et al.  Computing Krippendorff's Alpha-Reliability , 2011 .

[45]  Yidong Chen,et al.  A microblog dataset for tibetan sentiment analysis , 2017, 2017 International Conference on Asian Language Processing (IALP).

[46]  Anna Rumshisky,et al.  RuSentiment: An Enriched Sentiment Analysis Dataset for Social Media in Russian , 2018, COLING.

[47]  Diego Reforgiato Recupero,et al.  ESWC'14 Challenge on Concept-Level Sentiment Analysis , 2014, SemWebEval@ESWC.

[48]  J. Jussila,et al.  Reliability and Perceived Value of Sentiment Analysis for Twitter Data , 2017 .

[49]  Preslav Nakov,et al.  SemEval-2013 Task 2: Sentiment Analysis in Twitter , 2013, *SEMEVAL.

[50]  Erik Velldal,et al.  Annotating evaluative sentences for sentiment analysis: a dataset for Norwegian , 2019, NODALIDA.

[51]  Emily Öhman,et al.  Sentimentator: Gamifying Fine-Grained Sentiment Annotation , 2018, DHN.

[52]  Erik Velldal,et al.  NoReC: The Norwegian Review Corpus , 2017, LREC.

[53]  Ekin Ekinci,et al.  An annotated corpus for Turkish sentiment analysis at sentence level , 2017, 2017 International Artificial Intelligence and Data Processing Symposium (IDAP).

[54]  Hassan Maleki,et al.  SentiPers: A Sentiment Analysis Corpus for Persian , 2018, ArXiv.

[55]  Simon Clematide,et al.  MLSA - A Multi-layered Reference Corpus for German Sentiment Analysis , 2012, LREC.

[56]  Svetlana Alexeeva,et al.  An Opinion Word Lexicon and a Training Dataset for Russian Sentiment Analysis of Social Media , 2016 .

[57]  Suresh Manandhar,et al.  SemEval-2014 Task 4: Aspect Based Sentiment Analysis , 2014, *SEMEVAL.

[58]  Malvina Nissim,et al.  Overview of the Evalita 2016 SENTIment POLarity Classification Task , 2014, CLiC-it/EVALITA.

[59]  Cristina Bosco,et al.  Annotating Sentiment and Irony in the Online Italian Political Debate on #labuonascuola , 2016, LREC.

[60]  Preslav Nakov,et al.  Developing a successful SemEval task in sentiment analysis of Twitter and other social media texts , 2016, Language Resources and Evaluation.

[61]  Simon Krek,et al.  A Multilingual Social Media Linguistic Corpus , 2016 .

[62]  Emily Öhman,et al.  Challenges in Annotation: Annotator Experiences from a Crowdsourced Emotion Annotation Task , 2020, DHN.

[63]  Jörg Tiedemann,et al.  OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles , 2016, LREC.

[64]  Alessandro Moschitti,et al.  SenTube: A Corpus for Sentiment Analysis on YouTube Social Media , 2014, LREC.

[65]  Walaa Medhat,et al.  Sentiment analysis algorithms and applications: A survey , 2014 .

[66]  Hsin-Hsi Chen,et al.  Overview of Opinion Analysis Pilot Task at NTCIR-6 , 2007, NTCIR.

[67]  Malvina Nissim,et al.  Overview of the Evalita 2014 SENTIment POLarity Classification Task , 2014 .

[68]  Jeongwoo Ko,et al.  GoEmotions: A Dataset of Fine-Grained Emotions , 2020, ACL.

[69]  Mahmoud Al-Ayyoub,et al.  An extended analytical study of Arabic sentiments , 2014, Int. J. Big Data Intell..

[70]  Janyce Wiebe,et al.  MPQA 3.0: An Entity/Event-Level Sentiment Corpus , 2015, NAACL.

[71]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[72]  Jonathon Read,et al.  Using Emoticons to Reduce Dependency in Machine Learning Techniques for Sentiment Classification , 2005, ACL.

[73]  Mike Thelwall,et al.  Sentiment in short strength detection informal text , 2010 .

[74]  Hsin-Hsi Chen,et al.  Overview of the NTCIR-6 Cross-Lingual Question Answering (CLQA) Task , 2007, NTCIR.

[75]  Cristina Bosco,et al.  Developing Corpora for Sentiment Analysis: The Case of Irony and Senti-TUT , 2013, IEEE Intelligent Systems.

[76]  Harith Alani,et al.  Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new dataset, the STS-Gold , 2013, ESSEM@AI*IA.

[77]  Jörg Tiedemann,et al.  Emotion Preservation in Translation: Evaluating Datasets for Annotation Projection , 2020, DHN.

[78]  Muhammad Abdul-Mageed,et al.  EmoNet: Fine-Grained Emotion Detection with Gated Recurrent Neural Networks , 2017, ACL.

[79]  Norton Trevisan Roman,et al.  An Annotated Corpus for Sentiment Analysis in Political News , 2015, STIL.

[80]  Haris Papageorgiou,et al.  SemEval-2016 Task 5: Aspect Based Sentiment Analysis , 2016, *SEMEVAL.

[81]  Andreas Niekler,et al.  PACE Corpus: a multilingual corpus of Polarity-annotated textual data from the domains Automotive and CEllphone , 2014, LREC.

[82]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[83]  Hsin-Hsi Chen,et al.  Test Collection Selection and Gold Standard Generation for a Multiply-Annotated Opinion Corpus , 2007, ACL.

[84]  Ido Dagan,et al.  Synthesis Lectures on Human Language Technologies , 2009 .

[85]  Yuji Matsumoto,et al.  Emotion Classification Using Massive Examples Extracted from the Web , 2008, COLING.

[86]  Theresa Wilson Fine-grained subjectivity and sentiment analysis: recognizing the intensity, polarity, and attitudes of private states , 2008 .

[87]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[88]  Kaisla Kajava,et al.  Cross-Lingual Sentiment Preservation and Transfer Learning in Binary and Multi-Class Classification , 2018 .

[89]  Hsin-Hsi Chen,et al.  Overview of Multilingual Opinion Analysis Task at NTCIR-7 , 2008, NTCIR.

[90]  ThelwallMike,et al.  Sentiment strength detection in short informal text , 2010 .

[91]  Saif Mohammad,et al.  CROWDSOURCING A WORD–EMOTION ASSOCIATION LEXICON , 2013, Comput. Intell..

[92]  Chen Liu,et al.  DENS: A Dataset for Multi-class Emotion Analysis , 2019, EMNLP.

[93]  Julia Maria Struß,et al.  IGGSA Shared Tasks on German Sentiment Analysis (GESTALT) , 2014 .