Closing a gap in the language resources landscape : Groundwork and best practices from projects on computer-mediated communication in four European countries.

The paper presents best practices and results from projects in four countries dedicated to the creation of corpora of computer-mediated communication and social media interactions (CMC). Even though there are still many open issues related to building and annotating corpora of that type, there already exists a range of accessible solutions which have been tested in projects and which may serve as a starting point for a more precise discussion of how future standards for CMC corpora may (and should) be shaped like.

[1]  Wessel Stoop,et al.  Collecting Facebook Posts and WhatsApp Chats - Corpus Compilation of Private Social Media Messages , 2016, TSD.

[2]  C. M. Sperberg-McQueen,et al.  Guidelines for electronic text encoding and interchange , 1994 .

[3]  Stefan Thater,et al.  Improving the Performance of Standard Part-of-Speech Taggers for Computer-Mediated Communication , 2014, KONVENS.

[4]  Ciara R. Wigham,et al.  Interactions between text chat and audio modalities for L2 communication and feedback in the synthetic world Second Life , 2015 .

[5]  Eric N. Forsyth Improving automated lexical and discourse analysis of online chat dialog , 2007 .

[6]  Harald Lüngen,et al.  *Integrating corpora of computer-mediated communication in CLARIN-D: Results from the curation project ChatCorpus2CLARIN , 2016, KONVENS.

[7]  Tomaž Erjavec,et al.  Omogočanje dostopa do korpusov slovenskih spletnih besedil v luči pravnih omejitev , 2016 .

[8]  Marie-Josée Hamel,et al.  Language-Learner Computer Interactions: Theory, methodology and CALL applications , 2016 .

[9]  Paul Rayson,et al.  Children Online: A survey of child language and CMC corpora , 2012 .

[10]  Tomaž Erjavec,et al.  Normalising Slovene data: historical texts vs. user-generated content , 2016, KONVENS.

[11]  Iryna Gurevych,et al.  WebAnno: A Flexible, Web-based and Visually Supported System for Distributed Annotations , 2013, ACL.

[12]  Elisabeth Stark,et al.  sms4science: An international corpus-based texting project and the specific challenges for multilingual Switzerland , 2011 .

[13]  Eliza Margaretha,et al.  Building Linguistic Corpora from Wikipedia Articles and Discussions , 2014, J. Lang. Technol. Comput. Linguistics.

[14]  Angelika Storrer,et al.  A TEI Schema for the Representation of Computer-mediated Communication , 2012 .

[15]  S. M. García,et al.  2014: , 2020, A Party for Lazarus.

[16]  Jennifer-Carmen Frey,et al.  The DiDi Corpus of South Tyrolean CMC Data: A multilingual corpus of Facebook texts , 2016, CLiC-it/EVALITA.

[17]  Stefan Evert,et al.  EmpiriST 2015: A Shared Task on the Automatic Linguistic Annotation of Computer-Mediated Communication and Web Corpora , 2016, WAC@ACL.

[18]  Tomaz Erjavec,et al.  MULTEXT-East: morphosyntactic resources for Central and Eastern European languages , 2011, Language Resources and Evaluation.

[19]  Craig H. Martell,et al.  Lexical and Discourse Analysis of Online Chat Dialog , 2007, International Conference on Semantic Computing (ICSC 2007).

[20]  Adam Kilgarriff,et al.  The Sketch Engine: ten years on , 2014 .

[21]  Tomaž Erjavec,et al.  JANES v0.4: Korpus slovenskih spletnih uporabniških vsebin , 2016 .

[22]  Nelleke Oostdijk,et al.  The Construction of a 500-Million-Word Reference Corpus of Contemporary Written Dutch , 2013, Essential Speech and Language Technology for Dutch.

[23]  Tomaz Erjavec,et al.  Corpus-Based Diacritic Restoration for South Slavic Languages , 2016, LREC.

[24]  Angelika Storrer,et al.  DeRiK: A German reference corpus of computer-mediated communication , 2013, Lit. Linguistic Comput..

[25]  Tomaz Erjavec,et al.  Corpus vs. Lexicon Supervision in Morphosyntactic Tagging: the Case of Slovene , 2016, LREC.

[26]  Florence March,et al.  2016 , 2016, Affair of the Heart.

[27]  Brook Bolander,et al.  Doing sociolinguistic research on computer-mediated data : a review of four methodological issues , 2014 .

[28]  Tomaz Erjavec,et al.  The IMP historical Slovene language resources , 2015, Lang. Resour. Evaluation.

[29]  Swantje Westpfahl,et al.  FOLK-Gold ― A Gold Standard for Part-of-Speech-Tagging of Spoken German , 2016, LREC.

[30]  Natalia Grabar,et al.  Wikiconflits : un corpus de discussions éditoriales conflictuelles du Wikipédia francophone , 2017 .

[31]  Harald Lüngen,et al.  Building and Annotating a Corpus of German-Language Newsgroups , 2015 .

[32]  Angelika Storrer,et al.  Corpora of computer-mediated communication , 2008 .

[33]  Benoît Sagot,et al.  The CoMeRe corpus for French: structuring and annotating heterogeneous CMC genres , 2014, J. Lang. Technol. Comput. Linguistics.