论文信息 - Closing a gap in the language resources landscape : Groundwork and best practices from projects on computer-mediated communication in four European countries. - 字舞流文

Closing a gap in the language resources landscape : Groundwork and best practices from projects on computer-mediated communication in four European countries.

The paper presents best practices and results from projects in four countries dedicated to the creation of corpora of computer-mediated communication and social media interactions (CMC). Even though there are still many open issues related to building and annotating corpora of that type, there already exists a range of accessible solutions which have been tested in projects and which may serve as a starting point for a more precise discussion of how future standards for CMC corpora may (and should) be shaped like.

Tomaž Erjavec | Darja Fišer | Ciara R. Wigham | Nikola Ljubešić | Harald Lüngen | Thierry Chanier | Egon Stemle | Céline Poudat | Angelika Storrer | Michael Beißwenger | Axel Herold | Isabella Chiari | Egon W. Stemle | T. Erjavec | Nikola Ljubesic | Axel Herold | Céline Poudat | Angelika Storrer | Darja Fišer | M. Beißwenger | T. Chanier | H. Lüngen | I. Chiari

[1] Wessel Stoop,et al. Collecting Facebook Posts and WhatsApp Chats - Corpus Compilation of Private Social Media Messages , 2016, TSD.

[2] C. M. Sperberg-McQueen,et al. Guidelines for electronic text encoding and interchange , 1994 .

[3] Stefan Thater,et al. Improving the Performance of Standard Part-of-Speech Taggers for Computer-Mediated Communication , 2014, KONVENS.

[4] Ciara R. Wigham,et al. Interactions between text chat and audio modalities for L2 communication and feedback in the synthetic world Second Life , 2015 .

[5] Eric N. Forsyth. Improving automated lexical and discourse analysis of online chat dialog , 2007 .

[6] Harald Lüngen,et al. *Integrating corpora of computer-mediated communication in CLARIN-D: Results from the curation project ChatCorpus2CLARIN , 2016, KONVENS.

[7] Tomaž Erjavec,et al. Omogočanje dostopa do korpusov slovenskih spletnih besedil v luči pravnih omejitev , 2016 .

[8] Marie-Josée Hamel,et al. Language-Learner Computer Interactions: Theory, methodology and CALL applications , 2016 .

[9] Paul Rayson,et al. Children Online: A survey of child language and CMC corpora , 2012 .

[10] Tomaž Erjavec,et al. Normalising Slovene data: historical texts vs. user-generated content , 2016, KONVENS.

[11] Iryna Gurevych,et al. WebAnno: A Flexible, Web-based and Visually Supported System for Distributed Annotations , 2013, ACL.

[12] Elisabeth Stark,et al. sms4science: An international corpus-based texting project and the specific challenges for multilingual Switzerland , 2011 .

[13] Eliza Margaretha,et al. Building Linguistic Corpora from Wikipedia Articles and Discussions , 2014, J. Lang. Technol. Comput. Linguistics.

[14] Angelika Storrer,et al. A TEI Schema for the Representation of Computer-mediated Communication , 2012 .

[15] S. M. García,et al. 2014: , 2020, A Party for Lazarus.

[16] Jennifer-Carmen Frey,et al. The DiDi Corpus of South Tyrolean CMC Data: A multilingual corpus of Facebook texts , 2016, CLiC-it/EVALITA.

[17] Stefan Evert,et al. EmpiriST 2015: A Shared Task on the Automatic Linguistic Annotation of Computer-Mediated Communication and Web Corpora , 2016, WAC@ACL.

[18] Tomaz Erjavec,et al. MULTEXT-East: morphosyntactic resources for Central and Eastern European languages , 2011, Language Resources and Evaluation.

[19] Craig H. Martell,et al. Lexical and Discourse Analysis of Online Chat Dialog , 2007, International Conference on Semantic Computing (ICSC 2007).

[20] Adam Kilgarriff,et al. The Sketch Engine: ten years on , 2014 .

[21] Tomaž Erjavec,et al. JANES v0.4: Korpus slovenskih spletnih uporabniških vsebin , 2016 .

[22] Nelleke Oostdijk,et al. The Construction of a 500-Million-Word Reference Corpus of Contemporary Written Dutch , 2013, Essential Speech and Language Technology for Dutch.

[23] Tomaz Erjavec,et al. Corpus-Based Diacritic Restoration for South Slavic Languages , 2016, LREC.

[24] Angelika Storrer,et al. DeRiK: A German reference corpus of computer-mediated communication , 2013, Lit. Linguistic Comput..

[25] Tomaz Erjavec,et al. Corpus vs. Lexicon Supervision in Morphosyntactic Tagging: the Case of Slovene , 2016, LREC.

[26] Florence March,et al. 2016 , 2016, Affair of the Heart.

[27] Brook Bolander,et al. Doing sociolinguistic research on computer-mediated data : a review of four methodological issues , 2014 .

[28] Tomaz Erjavec,et al. The IMP historical Slovene language resources , 2015, Lang. Resour. Evaluation.

[29] Swantje Westpfahl,et al. FOLK-Gold ― A Gold Standard for Part-of-Speech-Tagging of Spoken German , 2016, LREC.

[30] Natalia Grabar,et al. Wikiconflits : un corpus de discussions éditoriales conflictuelles du Wikipédia francophone , 2017 .

[31] Harald Lüngen,et al. Building and Annotating a Corpus of German-Language Newsgroups , 2015 .

[32] Angelika Storrer,et al. Corpora of computer-mediated communication , 2008 .

[33] Benoît Sagot,et al. The CoMeRe corpus for French: structuring and annotating heterogeneous CMC genres , 2014, J. Lang. Technol. Comput. Linguistics.