A Corpus of Dutch Aphasic Speech: Sketching the Design and Performing a Pilot Study

In this thesis, a pilot study for the development of a corpus of Dutch aphasic speech (CoDAS) is presented. Given the lack of resources of this kind not only for Dutch but also for other languages, CoDAS will be able to set standards and will contribute to the future research in this area. A corpus of Dutch aphasic speech should fulfill at least three requirements. First, it should encode a plausible sample of contemporary Dutch as spoken by aphasic patients. That is, it should include speech representing different types of aphasia as well as various communication settings. Secondly, the speech fragments should be documented with the relevant metadata which should include information about the speaker and aphasia. Thirdly, the corpus should be enriched with various kinds of linguistic information. Given the special character of the speech contained in CoDAS, we cannot simply carry over the design and the annotation protocols of existing corpora, such as SDC or CHILDES. However, they have been assumed as starting point. In our pilot study, we have established the basic requirements with respect to text types, metadata, and annotation levels that CoDAS should fulfill. In this respect, we have investigated whether and how the procedures and protocols for the annotation and transcription used for the SDC should be adapted in order to annotate and transcribe the aphasic speech properly. In particular, for the orthographic transcription and the part-of-speech tagging, suggestions for improvement of the existing protocols have been given. On the other hand, the phonetic transcription procedure assumedwithin the SDC can be adopted without major modifications.

[1]  Walter Daelemans,et al.  Improving Accuracy in word class tagging through the Combination of Machine Learning Systems , 2001, CL.

[2]  David McKelvie,et al.  Data in Your Language: the Eci Multilingual Corpus 1 , 2007 .

[3]  Russell J. Love,et al.  Neurology for the speech-language pathologist , 1986 .

[4]  Jean Véronis,et al.  Text Encoding Initiative , 1995, Springer Netherlands.

[5]  Brian MacWhinney,et al.  Cross-linguistic research in aphasia: An overview , 1991, Brain and Language.

[6]  Adwait Ratnaparkhi,et al.  A Simple Introduction to Maximum Entropy Models for Natural Language Processing , 1997 .

[7]  Mitchell P. Marcus,et al.  Adding Semantic Annotation to the Penn TreeBank , 1998 .

[8]  Catia Cucchiarini,et al.  Phonetic transcriptions in the spoken dutch corpus: how to combine efficiency and good transcription quality , 2001, INTERSPEECH.

[9]  Johansson. Stig,et al.  Manual of information to accompany the Lancaster-Oslo : Bergen Corpus of British English, for use with digital computers , 1978 .

[10]  Lou Boves,et al.  Spontaneous Speech in the Spoken Dutch Corpus , 2003 .

[11]  Walter Daelemans,et al.  Evaluatie van part-of-speech taggers voor het corpus gesproken nederlands , 2001 .

[12]  Daniel Jurafsky,et al.  An introduction to natural language processing , 2000 .

[13]  Brian MacWhinney,et al.  The CHILDES System , 1996 .

[14]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[15]  Geoffrey Leech,et al.  CLAWS4: The Tagging of the British National Corpus , 1994, COLING.

[16]  Lou Boves,et al.  Experiences from the Spoken Dutch Corpus Project , 2002, LREC.

[17]  Merja Kytö,et al.  Manual to the diachronic part of the Helsinki Corpus of English texts : cording conventions and lists of source texts , 1993 .

[18]  J. Ghajar Traumatic brain injury , 2000, The Lancet.

[19]  Eric Brill,et al.  Some Advances in Transformation-Based Part of Speech Tagging , 1994, AAAI.

[20]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[21]  Walter Daelemans,et al.  TiMBL: Tilburg Memory-Based Learner , 2007 .

[22]  Jean-Pierre Martens,et al.  Orthographic Transcription of the Spoken Dutch Corpus , 2000, LREC.

[23]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[24]  R. R. Favretti,et al.  CORIS/CODIS: A corpus of written Italian based on a defined and a dynamic model , 2002 .

[25]  Edmund Fantino,et al.  Recent Developments In Choice , 1974, Psychology of Learning and Motivation.

[26]  D. Haddock Cerebrovascular accidents in Ghana. , 1970, Transactions of the Royal Society of Tropical Medicine and Hygiene.

[27]  Signe Oksefjell,et al.  A description of the English-Norwegian parallel corpus : Compilation and further developments , 1999 .

[28]  Adrian Akmajian,et al.  Linguistics: An Introduction to Language and Communication , 1979 .

[29]  Walter Daelemans,et al.  A language-independent, data-oriented architecture for grapheme-to-phoneme conversion , 1994, SSW.

[30]  D. M. Binnenpoorte,et al.  Assessing Manually Corrected Broad Phonetic Transcriptions in the Spoken Dutch Corpus , 2003 .

[31]  L. Shriberg,et al.  Reliability studies in broad and narrow phonetic transcription , 1991 .

[32]  Christer Samuelsson,et al.  Handling Sparse Data by Successive Abstraction , 1996, COLING.

[33]  Nelleke Oostdijk,et al.  Meta-data in the Spoken Dutch Corpus project , 2000 .

[34]  W.J.M. Haeseryn Algemene Nederlandse spraakkunst , 1997 .

[35]  Paul Boersma,et al.  Praat: doing phonetics by computer , 2003 .

[36]  A. Holland,et al.  Rapid recovery from aphasia: A detailed language analysis , 1985, Brain and Language.

[37]  Gavin Burnage Celex-a guide for users , 1990 .

[38]  J. Mackenbach,et al.  [Aphasia in the Netherlands; extent of the problem]. , 1984, Tijdschrift voor gerontologie en geriatrie.

[39]  Alon Lavie,et al.  Adding Syntactic Annotations to Transcripts of Parent-Child Dialogs , 2004, LREC.

[40]  B. MacWhinney The CHILDES project: tools for analyzing talk , 1992 .

[41]  Guy Aston,et al.  The BNC Handbook: Exploring the British National Corpus with SARA , 1998 .

[42]  Graeme D. Kennedy,et al.  Book Reviews: An Introduction to Corpus Linguistics , 1999, CL.

[43]  Walter Daelemans,et al.  Part of Speech Tagging and Lemmatisation for the Spoken Dutch Corpus , 2000, LREC.

[44]  Walter Daelemans,et al.  Lemmatisation and morphosyntactic annotation for the spoken Dutch corpus , 1999, CLIN.

[45]  M Wester,et al.  Obtaining Phonetic Transcriptions: A Comparison between Expert Listeners and a Continuous Speech Recognizer , 2001, Language and speech.

[46]  Michael Moortgat,et al.  Syntactic Analysis in the Spoken Dutch Corpus (CGN) , 2002, LREC.

[47]  Walter Daelemans,et al.  Language-Independent Data-Oriented Grapheme-to-Phoneme Conversion , 1996 .

[48]  Walter Daelemans,et al.  TreeTalk: Memory-based word phonemisation , 2001 .