论文信息 - Syntactic Annotation of Slovene CMC: First Steps

Syntactic Annotation of Slovene CMC: First Steps

This paper presents the first steps towards the syntactic annotation of Slovene CMC, namely the annotation of 200 Slovene tweets with the JOS dependency model. After a presentation of the dataset we present the selected annotation model, the annotation procedure, and results. The focus of the paper is on the decisions regarding the annotation of CMC-specific elements that required special treatment: Twitter-specific features, foreign language elements, ellipsis and fragments, non-standard use of punctuation, and other non-standard language features. The dataset, together with the CMC-adapted annotation guidelines, can be used for further annotation of language data (from Twitter or other CMC genres), and in the second step to train a parser for the selected CMC domain(s). The large-scale corpusbased research of non-standard Slovene syntax, which will be facilitated by the described activities, will help disprove the myths surrounding CMC that are still present in the field of Slovene studies.

Simon Krek | Špela Arhar Holdt | Tomaž Erjavec | Darja Fišer

[1] Santa Barbara. k dixez? A corpus study of Spanish Internet orthography , 2009 .

[2] Joakim Nivre,et al. The Universal Dependencies Treebank of Spoken Slovenian , 2016, LREC.

[3] Tomaz Erjavec,et al. Standardizing Tweets with Character-Level Machine Translation , 2014, CICLing.

[4] Noah A. Smith,et al. A Dependency Parser for Tweets , 2014, EMNLP.

[5] Bruno Pouliquen,et al. Massive multi lingual corpus compilation: Acquis Communautaire and totale , 2005 .

[6] Jan Hajic,et al. The Prague Dependency Treebank , 2003 .

[7] Benoît Sagot,et al. The CoMeRe corpus for French: structuring and annotating heterogeneous CMC genres , 2014, J. Lang. Technol. Comput. Linguistics.

[8] Simon Krek,et al. The JOS Linguistically Tagged Corpus of Slovene , 2010, LREC.

[10] Anja Krajnc. Postavljanje vejic v slovenščini s pomočjo strojnega učenja , 2015 .

[11] Angelika Storrer,et al. Sprachverfall durch internetbasierte Kommunikation? Linguistische Erklärungsansätze - empirische Befunde , 2014 .

[12] Tomaz Erjavec,et al. Predicting the Level of Text Standardness in User-generated Content , 2015, RANLP.

[13] David Crystal,et al. Internet Linguistics: A Student Guide , 2011 .