The KiezDeutsch Korpus (KiDKo) Release 1.0

This paper presents the first release of the KiezDeutsch Korpus (KiDKo), a new language resource with multiparty spoken dialogues of Kiezdeutsch, a newly emerging language variety spoken by adolescents from multi-ethnic urban areas in Germany. The first release of the corpus includes the transcriptions of the data as well as a normalisation layer and part-of-speech annotations. In the paper, we describe the main features of the new resource and then focus on automatic POS tagging of informal spoken language. Our tagger achieves an accuracy of nearly 97% on KiDKo. While we did not succeed in further improving the tagger using ensemble tagging, we present our approach to using the tagger ensembles for identifying error patterns in the automatically tagged data.

[1]  Christian Chiarcos,et al.  ANNIS: A Search Tool for Multi-Layer Annotated Corpora , 2009 .

[2]  Karel Oliva,et al.  (Semi-)Automatic Detection of Errors in PoS-Tagged Corpora , 2002, COLING.

[3]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[4]  Erich Drach,et al.  Grundgedanken der deutschen Satzlehre , 1963 .

[5]  Anders Søgaard,et al.  Simple Semi-Supervised Training of Part-Of-Speech Taggers , 2010, ACL.

[6]  Christopher D. Manning,et al.  Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[7]  Helmut Schmid,et al.  Improvements in Part-of-Speech Tagging with an Application to German , 1999 .

[8]  Markus Dickinson,et al.  From Detecting Errors to Automatically Correcting Them , 2006, EACL.

[9]  Arne Fitschen,et al.  Ein computerlinguistisches Lexikon als komplexes System , 2004 .

[10]  Anke Lüdeling,et al.  Syntactic annotation of non-canonical linguistic structures , 2007 .

[11]  Eric Brill,et al.  Classifier Combination for Improved Lexical Disambiguation , 1998, ACL.

[12]  Josep Carmona,et al.  Improving POS Tagging Using Machine-Learning Techniques , 1999, EMNLP.

[13]  Hans van Halteren,et al.  The Detection of Inconsistency in Manually Tagged Text , 2000, COLING 2000.

[14]  Kuncheva Ch Classifier Combination , 2009, Encyclopedia of Database Systems.

[15]  José Gabriel Pereira Lopes,et al.  Detection of Strange and Wrong Automatic Part-of-Speech Tagging , 2007, EPIA Workshops.

[16]  Sabine Brants,et al.  The TIGER Treebank , 2001 .

[17]  References , 1971 .

[18]  Ines Rehbein,et al.  STTS goes Kiez - Experiments on Annotating and Tagging Urban Youth Language , 2013, J. Lang. Technol. Comput. Linguistics.

[19]  Walt Detmar Meurers,et al.  Detecting Errors in Part-of-Speech Annotation , 2003, EACL.

[20]  Peter Auer,et al.  Ethnische Marker im Deutschen zwischen Varietät und Stil , 2013 .

[21]  Grzegorz Chrupala,et al.  Efficient induction of probabilistic word classes with LDA , 2011, IJCNLP.

[22]  Heike Wiese,et al.  Grammatical innovation in multiethnic urban Europe: New linguistic practices among adolescents , 2009 .

[23]  Thomas Schmidt EXMARaLDA and the FOLK tools - two toolsets for transcribing and annotating spoken language , 2012, LREC.

[24]  M. Selting Gesprächsanalytisches Transkriptionssystem (GAT): 2102 , 1998 .

[25]  Hrafn Loftsson,et al.  Correcting a POS-Tagged Corpus Using Three Complementary Methods , 2009, EACL.

[26]  Maria Leonor Pacheco,et al.  of the Association for Computational Linguistics: , 2001 .