GU-ISS-2019-03 Assessing the quality of Språkbanken ’ s annotations

Most of the corpora in Språkbanken Text consist of unannotated plain text, such as almost all newspaper texts, social media texts, novels and official documents. We also have some corpora that are manually annotated in different ways, such as Talbanken (annotated for part-of-speech and syntactic structure), and the Stockholm Umeå Corpus (annotated for part-of-speech). Språkbanken’s annotation pipeline Sparv aims to automatise the work of automatically annotating all our corpora, while still keeping the manual annotations intact. When all corpora are annotated, they can be made available, e.g., in the corpus searh tools Korp and Strix. Until now there has not been any comprehensive overview of the annotation tools and models that Sparv has been using for the last eight years. Some of them have not been updated since the start, such as the part-of-speech tagger Hunpos and the dependency parser MaltParser. There are also annotation tools that we still have not included, such as a constituency-based parser. Therefore Språkbanken initiated a project with the aim of conducting such an overview. This document is the outcome of that project, and it contains descriptions of the types of manual and automatic annotations that we currently have in Språkbanken, as well as an incomplete overview of the state-of-the-art with regards to annotation tools and models.

[1]  Lars Borin,et al.  SenSALDO: a Swedish Sentiment Lexicon for the SWE-CLARIN Toolbox , 2019, CLARIN Annual Conference.

[2]  Ari Rappoport,et al.  BLEU is Not Suitable for the Evaluation of Text Simplification , 2018, EMNLP.

[3]  Tal Linzen,et al.  Targeted Syntactic Evaluation of Language Models , 2018, EMNLP.

[4]  Piek T. J. M. Vossen,et al.  A Deep Dive into Word Sense Disambiguation with LSTM , 2018, COLING.

[5]  José Camacho-Collados,et al.  From Word to Sense Embeddings: A Survey on Vector Representations of Meaning , 2018, J. Artif. Intell. Res..

[6]  Lars Borin,et al.  SenSALDO: Creating a Sentiment Lexicon for Swedish , 2018, LREC.

[7]  Richard Johansson,et al.  Training Word Sense Embeddings With Lexicon-based Regularization , 2017, IJCNLP.

[8]  Milos Stanojevic,et al.  Neural Discontinuous Constituency Parsing , 2017, EMNLP.

[9]  Jeremy Barnes,et al.  Assessing State-of-the-Art Sentiment Models on State-of-the-Art Sentiment Datasets , 2017, WASSA@EMNLP.

[10]  Joakim Nivre,et al.  Universal Dependency Evaluation , 2017, UDW@NoDaLiDa.

[11]  José Camacho-Collados,et al.  Word Sense Disambiguation: A Unified Evaluation Framework and Empirical Comparison , 2017, EACL.

[12]  Francis M. Tyers,et al.  Universal Dependencies , 2017, EACL.

[13]  Matthias Sperber,et al.  Lightly Supervised Quality Estimation , 2016, COLING.

[14]  Paola Merlo,et al.  Multi-lingual Dependency Parsing Evaluation: a Large-scale Analysis of Word Order Properties using Artificial Data , 2016, TACL.

[15]  Richard Johansson,et al.  Embedding Senses for Efficient Graph-based Word Sense Disambiguation , 2016, TextGraphs@NAACL-HLT.

[16]  Ildikó Pilán,et al.  Helping Swedish words come to their senses: word-sense disambiguation based on sense associations from the SALDO lexicon , 2015, NODALIDA.

[17]  Markus Dickinson,et al.  Detection of Annotation Errors in Corpora , 2015, Lang. Linguistics Compass.

[18]  Jean-Yves Antoine,et al.  Weighted Krippendorff’s alpha is a more reliable metrics for multi-coders ordinal annotations: experimental studies on emotion, opinion and coreference annotation , 2014, EACL.

[19]  Markus Forsberg,et al.  SALDO: a touch of yin to WordNet’s yang , 2013, Lang. Resour. Evaluation.

[20]  Bob Carpenter,et al.  The Benefits of a Model of Annotation , 2013, Transactions of the Association for Computational Linguistics.

[21]  Stefan Dlugolinsky,et al.  Evaluation of named entity recognition tools on microposts , 2013, 2013 IEEE 17th International Conference on Intelligent Engineering Systems (INES).

[22]  Stephan Oepen,et al.  Sentence Boundary Detection: A Long Solved Problem? , 2012, COLING.

[23]  Stephan Oepen,et al.  Tokenization: Returning to a Long Solved Problem — A Survey, Contrastive Experiment, Recommendations, and Toolkit — , 2012, ACL.

[24]  Evelina Andersson,et al.  Evaluating Dependency Parsing: Robust and Heuristics-Free Cross-Annotation Evaluation , 2011, EMNLP.

[25]  Elena Beisswanger,et al.  A Proposal for a Configurable Silver Standard , 2010, Linguistic Annotation Workshop.

[26]  Dietrich Rebholz-Schuhmann,et al.  Calbc Silver Standard Corpus , 2010, J. Bioinform. Comput. Biol..

[27]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[28]  Noah A. Smith,et al.  Dependency Parsing , 2009, Encyclopedia of Artificial Intelligence.

[29]  Joakim Nivre,et al.  Bootstrapping a Swedish Treebank Using Cross-Corpus Harmonization and Annotation Projection , 2007 .

[30]  Tibor Kiss,et al.  Unsupervised Multilingual Sentence Boundary Detection , 2006, CL.

[31]  Benoît Sagot,et al.  Error Mining in Parsing Results , 2006, ACL.

[32]  Dana Dannélls,et al.  Exploring the Quality of the Digital Historical Newspaper Archive KubHist , 2019, DHN.

[33]  Gerlof Bouma,et al.  FSvReader - Exploring Old Swedish Cultural Heritage Texts , 2018, DHN.

[34]  Fang Chen,et al.  A Graph-theoretic Summary Evaluation for ROUGE , 2018, EMNLP.

[35]  Markus Forsberg,et al.  SWORD : Towards Cutting-Edge Swedish Word Processing , 2016 .

[36]  Lars Borin,et al.  A free cloud service for OCR /En fri molntjänst för OCR Project report , 2016 .

[37]  Chu-Ren Huang,et al.  Selective Annotation of Sentence Parts: Identification of Relevant Sub-sentential Units , 2016, ALR@COLING.

[38]  Kerstin Eckart,et al.  Creating Silver Standard Annotations for a Corpus of Non-Standard Data , 2016, KONVENS.

[39]  Philipp Koehn,et al.  Synthesis Lectures on Human Language Technologies , 2016 .

[40]  Haizhou Li,et al.  Evaluating and Combining Name Entity Recognition Systems , 2016, NEWS@ACM.

[41]  Luis Nieto Piña,et al.  Embedding a Semantic Network in a Word Space , 2015, NAACL.

[42]  Lars Borin,et al.  HFST-SweNER — A New NER Resource for Swedish , 2014, LREC.

[43]  Pasi Tapanainen,et al.  What is a word, What is a sentence? Problems of Tokenization , 1994 .

[44]  S. C. Bring,et al.  Svenskt ordförråd : ordnat i begreppsklasser , 1930 .