论文信息 - Annotated Corpus of Polish Spoken Dialogues

Annotated Corpus of Polish Spoken Dialogues

The paper presents a corpus of Polish spoken dialogues being a result of the LUNA (spoken Language UNderstanding in multilinguAl communication systems) project. We describe the process of collecting the corpus and its annotation on several levels, from transcription of dialogues and their morphosyntactic analysis, to semantic annotation on concepts and predicates. Annotation on the morphosyntactic and semantic levels was done automatically and then manually corrected. At the concept level, the annotation scheme comprises about 200 concepts from an ontology designed specially for the project. The set of frames for predicate level annotation was defined as a FrameNet-like resource.

Malgorzata Marciniak | Ryszard Gubrynowicz | Krzysztof Marasek | Agnieszka Mykowiecka | Joanna Rabiega-Wisniewska

[1] Elżbieta Hajnicz,et al. Przegląd analizatorów morfologicznych dla języka polskiego , 2001 .

[2] Agnieszka Mykowiecka,et al. Proper Names in Dialogs from the Warsaw Transportation Call Center , 2008 .

[3] Christopher R. Johnson,et al. Background to Framenet , 2003 .

[4] Elena Pâslaru-Bontas. A contextual approach to ontology reuse: methodology, methods and tools for the Semantic Web , 2007 .

[5] Sophie Rosset,et al. Semantic annotation of the French media dialog corpus , 2005, INTERSPEECH.

[6] J. Lowe,et al. A Frame-Semantic Approach to Semantic Annotation , 1997 .

[7] Jan Cernocký,et al. Speechdat-e: five eastern european speech databases for voice-operated teleservices completed , 2001, INTERSPEECH.

[8] Mark Liberman,et al. Transcriber: a free tool for segmenting, labeling and transcribing speech , 1998, LREC.

[9] Frédéric Béchet,et al. Semantic Frame Annotation on the French MEDIA corpus , 2008, LREC.

[10] Agnieszka Mykowiecka,et al. Automatic Semantic Annotation of Polish Dialogue Corpus , 2008, TSD.