Challenges of Annotating a Code-Switching Treebank

This paper presents challenges and observations on creating a code-switching treebank based on ongoing annotation efforts of a Turkish–German spoken corpus following the Universal Dependencies annotation scheme. We present and discuss a number of issues that arise because of the need for consistent multilingual annotation within a single treebank, as well as the informal language which is where code-switching is observed most. Besides proposing solutions to these issues, our aim in this paper is to stimulate discussion and facilitate consistency over upcoming code-switching annotation projects.

[1]  Almeida Jacqueline Toribio,et al.  Code Switching and X-Bar Theory : The Functional Head Constraint , 2008 .

[2]  Thierry Poibeau,et al.  The First Komi-Zyrian Universal Dependencies Treebanks , 2018, UDW@EMNLP.

[3]  Riyaz Ahmad Bhat,et al.  Universal Dependency Parsing for Hindi-English Code-Switching , 2018, NAACL.

[4]  Wolfgang Menzel,et al.  Because Size Does Matter: The Hamburg Dependency Treebank , 2014, LREC.

[5]  Dilek Z. Hakkani-Tür,et al.  Building a Turkish Treebank , 2003 .

[6]  Kim Gerdes,et al.  Establishing a Language by Annotating a Corpus , 2018 .

[7]  Shana Poplack,et al.  Code Switching: Linguistic , 2001 .

[8]  Çagri Çöltekin,et al.  Universal Dependencies for Turkish , 2016, COLING.

[9]  Nizar Habash,et al.  CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies , 2017, CoNLL.

[10]  N. Poulisse,et al.  Duelling Languages: Grammatical Structure in Codeswitching , 1998 .

[11]  Özlem Çetinoglu,et al.  A Turkish-German Code-Switching Corpus , 2016, LREC.

[12]  Arne Köhn,et al.  Dependency Tree Transformation with Tree Transducers , 2017, UDW@NoDaLiDa.

[13]  Çağrı Çöltekin,et al.  A grammar-book treebank of Turkish , 2017 .

[14]  Sylvain Kahane,et al.  Trois schémas d’annotation syntaxique en dépendance pour un même corpus de français oral : le cas de la macrosyntaxe , 2017 .

[15]  Joakim Nivre,et al.  Universal Dependency Annotation for Multilingual Parsing , 2013, ACL.

[16]  Aravind K. Joshi,et al.  Processing of Sentences With Intra-Sentential Code-Switching , 1982, COLING.

[17]  A. Göksel,et al.  Turkish: A Comprehensive Grammar , 2004 .

[18]  Yuan Zhang,et al.  A Fast, Compact, Accurate Model for Language Identification of Codemixed Text , 2018, EMNLP.

[19]  Eva Maria Eppler,et al.  The syntax of German-English code-switching , 2005 .

[20]  Richard Hudson,et al.  English word grammar , 1995 .

[21]  Barbara E. Bullock,et al.  The Cambridge Handbook of Linguistic Code-switching: Conceptual and methodological considerations in code-switching research , 2009 .

[22]  Haitao Liu,et al.  Syntactic variations in Chinese–English code-switching , 2013 .

[23]  Monojit Choudhury,et al.  Word Embeddings for Code-Mixed Language Processing , 2018, EMNLP.

[24]  Wei Li,et al.  Sometimes I’ll start a sentence in Spanish y termino en espanol: toward a typology of code-switching , 2003 .

[25]  Sampo Pyysalo,et al.  Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[26]  Jennifer Foster,et al.  Universal Dependencies for Irish , 2016 .

[27]  Gülşen Eryiğit,et al.  IMST: A Revisited Turkish Dependency Treebank , 2016 .

[28]  Xinying Chen,et al.  Developing Universal Dependencies for Mandarin Chinese , 2016, ALR@COLING.

[29]  Sylvain Kahane,et al.  Dependency Annotation Choices: Assessing Theoretical and Practical Issues of Universal Dependencies , 2016, LAW@ACL.

[30]  Lilja Øvrelid,et al.  Universal Dependencies for Norwegian , 2016, LREC.

[31]  Peter Auer,et al.  Handbook of Multilingualism and Multilingual Communication , 2007 .

[32]  Joakim Nivre,et al.  The Universal Dependencies Treebank of Spoken Slovenian , 2016, LREC.