论文信息 - Enhancing Speech Corpus Resources with Multiple Lexical Tag Layers

Enhancing Speech Corpus Resources with Multiple Lexical Tag Layers

We describe a general two-stage procedure for re-using a custom corpus for spoken language system development involving a transformation from character-based markup to XML, and DSSSL stylesheet-driven XML markup enhancement with multiple lexical tag trees. The procedure was used to generate a fully tagged corpus; alternatively with greater economy of computing resources, it can be employed as a parametrised ‘tagging on demand’ filter. The implementation will shortly be released as a public resource together with the corpus (German spoken dialogue, about 500k word form tokens) and lexicon (about 75k word form types).

[1] C. M. Sperberg-McQueen,et al. Guidelines for electronic text encoding and interchange , 1994 .

[2] Andreas Witt. TEI-based XML-Applications: Transcriptions , 1998 .

[3] C. M. Sperberg-McQueen,et al. Guidelines for electronic text encoding and interchange : TEI P4 , 2002 .

[4] Andreas Witt,et al. DSSSL zur Verarbeitung linguistischer Korpora , 1999 .

[5] Harald Lüngen,et al. Automatic Induction of Lexical Inheritance Hierarchies , 1999 .

[6] Dafydd Gibbon,et al. Ein synkretismusmodell für die deutsche Morphologie , 1996, KONVENS.