FOLK-Gold ― A Gold Standard for Part-of-Speech-Tagging of Spoken German

In this paper, we present a GOLD standard of part-of-speech tagged transcripts of spoken German. The GOLD standard data consists of four annotation layers – transcription (modified orthography), normalization (standard orthography), lemmatization and POS tags – all of which have undergone careful manual quality control. It comes with guidelines for the manual POS annotation of transcripts of German spoken data and an extended version of the STTS (Stuttgart Tubingen Tagset) which accounts for phenomena typically found in spontaneous spoken German. The GOLD standard was developed on the basis of the Research and Teaching Corpus of Spoken German, FOLK, and is, to our knowledge, the first such dataset based on a wide variety of spontaneous and authentic interaction types. It can be used as a basis for further development of language technology and corpus linguistic applications for German spoken language.

[1]  Swantje Westpfahl,et al.  POS für(s) FOLK - Part of Speech Tagging des Forschungs- und Lehrkorpus Gesprochenes Deutsch , 2013, J. Lang. Technol. Comput. Linguistics.

[2]  Carole Etienne,et al.  Grands corpus et linguistique outillée pour l'étude du français en interaction (plateforme CLAPI et corpus CIEL) , 2010 .

[3]  Michael Strube,et al.  Part-of-Speech Tagging of Transcribed Speech , 2006, LREC.

[4]  Thomas Schmidt,et al.  The Research and Teaching Corpus of Spoken German ― FOLK , 2014, LREC.

[5]  Wolfgang Wahlster,et al.  Verbmobil: Erkennung, Analyse, Transfer, Generierung und Synthese von Spontansprache , 1997, GI Jahrestagung.

[6]  Swantje Westpfahl,et al.  STTS 2.0? Improving the Tagset for the Part-of-Speech-Tagging of German Spoken Data , 2014, LAW@COLING.

[7]  Thomas Schmidt EXMARaLDA and the FOLK tools - two toolsets for transcribing and annotating spoken language , 2012, LREC.

[8]  Ines Rehbein,et al.  The KiezDeutsch Korpus (KiDKo) Release 1.0 , 2014, LREC.

[9]  Thomas Schmidt A TEI-based Approach to Standardising Spoken Language Transcription , 2011 .

[10]  Ines Rehbein,et al.  Towards a syntactically motivated analysis of modifiers in German , 2014, KONVENS.

[11]  Nelleke Oostdijk,et al.  The Spoken Dutch Corpus. Overview and First Evaluation , 2000, LREC.

[12]  Erhard W. Hinrichs,et al.  WebLicht: Web-Based LRT Services for German , 2010, ACL.

[13]  Helmut Schmid,et al.  Improvements in Part-of-Speech Tagging with an Application to German , 1999 .

[14]  Beatrice Santorini Part-of-speech tagging guidelines for the penn treebank project , 1990 .

[15]  Ines Rehbein,et al.  STTS goes Kiez - Experiments on Annotating and Tagging Urban Youth Language , 2013, J. Lang. Technol. Comput. Linguistics.

[16]  Andreas Stolcke,et al.  The ICSI Meeting Corpus , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[17]  Beatrice Santorini,et al.  Part-of-Speech Tagging Guidelines for the Penn Treebank Project (3rd Revision) , 1990 .

[18]  Thomas C. Schmidt,et al.  Technological and methodological challenges in creating, annotating and sharing a learner corpus of spoken German , 2012 .

[19]  Olivier Baude,et al.  (Re)faire le corpus d’Orléans quarante ans après :quoi de neuf, linguiste ? , 2011 .