The Spoken Dutch Corpus. Overview and First Evaluation

In this paper the Spoken Dutch Corpus project is presented, a joint Flemish-Dutch undertaking aimed at the compilation and annotation of a 10-million-word corpus of spoken Dutch. Upon completion, the corpus will constitute a valuable resource for research in the fields of computational linguistics and language and speech technology. The paper first gives an overall description of the project, its aims, structure and organization. It then goes on to discuss the considerations % both methodological and practical % that have played a role in the design of the corpus as well as in its compilation and annotation. The paper concludes with an account of the data that are available in the first release of the first part of the corpus that came out on March 1st, 2000.

[1]  B. Donaldson Dutch: A Linguistic History of Holland and Belgium , 1983 .

[2]  R. Salverda Review of: B.C. Donaldson, Dutch: a linguistic history of Holland and Belgium (Leiden: Martinus Nijhoff, 1983) , 1984 .

[3]  G. Geerts Algemene Nederlandse spraakkunst , 1987 .

[4]  Mari Ostendorf,et al.  TOBI: a standard for labeling English prosody , 1992, ICSLP.

[5]  C. M. Sperberg-McQueen,et al.  Guidelines for electronic text encoding and interchange , 1994 .

[6]  Steve Young,et al.  WSJCAM0 corpus and recording description , 1994 .

[7]  G. Leech,et al.  EAGLES recommendations for the morphosyntactic annotation of corpora , 1996 .

[8]  Roger K. Moore,et al.  Handbook of standards and resources for spoken language systems , 1997 .

[9]  Dafydd Gibbon,et al.  Spoken Language Reference Materials , 1997 .

[10]  Gosse Bouma,et al.  De positie van het Nederlands in de taal- en spraaktechnologie , 1998 .

[11]  Guy Aston,et al.  The BNC Handbook: Exploring the British National Corpus with SARA , 1998 .

[12]  G. D. Bruycker,et al.  Haeseryn, W., K. Romijn, G. Geerts, J. de Rooij et M. van den Toorn: Algemene Nederlandse Spraakkunst , 1998 .

[13]  Gosse Bouma,et al.  Intergovernmental language policy for Dutch and the language and speech technology infrastructure , 1998 .

[14]  Walter Daelemans,et al.  Part of Speech Tagging and Lemmatisation for the Spoken Dutch Corpus , 2000, LREC.

[15]  Nancy Priest-Dorman Greg Ide,et al.  Corpus Encoding Standard (CES) , 2000 .

[16]  Jean-Pierre Martens,et al.  Orthographic Transcription of the Spoken Dutch Corpus , 2000, LREC.