Corpus linguistics and language documentation: challenges for collaboration

Recent literature in corpus linguistics (e.g., McEnery & Ostler 2000) and language documentation (e.g., Johnson 2004) suggests both disciplines may share natural points of interaction, having in common an interest in the construction and use of permanent collections of diverse linguistic data. Although considerable benefit might be anticipated from close collaboration between these two areas, divergences in their respective purposes, practices, and products may render such an interaction more difficult to foster than might initially be expected. This paper considers points of commonality and difference between corpus linguistics and language documentation in four specific areas of practice, drawing upon examples from ongoing corpus construction and language documentation efforts centered on Mennonite Plautdietsch in Canada. Given the results of this comparison, this study proposes viewing corpora as descriptive applications of language documentation, to be built directly upon the permanent documentary record. By founding corpora upon documentary materials, such an approach opens language documentation more readily to the analytical and methodological contributions of corpus linguistics, while providing a solid empirical basis for future corpus construction.

[1]  David Nathan,et al.  Thick Interfaces: Mobilizing Language Documentation with Multimedia , 2006 .

[2]  Steven Bird Last Words: Natural Language Processing and Linguistic Fieldwork , 2009, CL.

[3]  Ulrike Mosel,et al.  Chapter 1 Language documentation: What is it and what is it good for? , 2006 .

[4]  R. Turner Oriental and African Studies in Great Britain , 1948 .

[5]  Anke Lüdeling,et al.  Corpus Linguistics: An International Handbook , 2009 .

[6]  Nikolaus P. Himmelmann,et al.  Reproduction and Preservation of Linguistic Knowledge: Linguistics' Response to Language Endangerment , 2008 .

[7]  Susan Conrad,et al.  Corpus Linguistics: Investigating Language Structure and Use , 1998 .

[8]  Ulrike Mosel,et al.  Essentials of language documentation , 2006 .

[9]  Hans Rausing Endangered Languages,et al.  Language Documentation and Description , 2003 .

[10]  C. M. Sperberg-McQueen,et al.  Guidelines for electronic text encoding and interchange , 1994 .

[11]  Trevor Johnston,et al.  From archive to corpus: transcription and annotation in the creation of signed language corpora , 2008, PACLIC.

[12]  Wesley Y. Leonard,et al.  Making "collaboration" collaborative An examination of perspectives that frame linguistic field research , 2010 .

[13]  Guy Aston,et al.  The BNC Handbook: Exploring the British National Corpus with SARA , 1998 .

[14]  Daan Broeder,et al.  The IMDI Metadata set, its Tools and accessible Linguistic databases , 2001 .

[15]  G. Leech Corpora and theories of linguistic performance , 1992 .

[16]  Francisco Torreira,et al.  The Nijmegen Corpus of Casual Spanish , 2010, LREC.

[17]  P. Lewis Ethnologue : languages of the world , 2009 .

[18]  Nancy Ide,et al.  GrAF: A Graph-based Format for Linguistic Annotations , 2007, LAW@ACL.

[19]  Christopher Cox Probabilistic tagging of minority language data: a case study using Qtag , 2010 .

[20]  Stefan Evert,et al.  How Random is a Corpus? The Library Metaphor , 2006 .

[21]  Tony McEnery,et al.  A new agenda for corpus linguistics - working with all of the world's languages , 2000 .

[22]  Reuben Epp The Story of Low German & Plautdietsch : tracing a language across the globe , 1993 .

[23]  Jan Svartvik,et al.  Directions in corpus linguistics : proceedings of Nobel Symposium 82, Stockholm, 4-8 August 1991 , 1992 .

[24]  Steve Crowdy Spoken Corpus Design , 1993 .

[25]  Στέλλα Μαρκαντωνάτου,et al.  Proceedings of the Second International Conference on Language Resources and Evaluation , 2000 .

[26]  Christopher Cox,et al.  Software demonstration: CuPED (Customizable Presentation of ELAN Documents) , 2009 .

[27]  Martin Wynne,et al.  Developing Linguistic Corpora: a Guide to Good Practice , 2005 .

[28]  N. Himmelmann,et al.  Documentary and descriptive linguistics , 1998 .

[29]  Stefanie Wulff,et al.  Corpus-linguistic applications : current studies, new directions , 2010 .

[30]  Laurent Romary,et al.  International standard for a linguistic annotation framework , 2003, HLT-NAACL 2003.

[31]  Gary Simons,et al.  The OLAC Metadata Set and Controlled Vocabularies , 2001, ACL 2001.

[32]  Michael Stubbs,et al.  Words and Phrases: Corpus Studies of Lexical Semantics , 2001 .

[33]  T. Regehr Mennonites in Canada, 1939-1970: A People Transformed , 1996 .

[34]  Fiona M. Douglas The Scottish Corpus of Texts and Speech: Problems of Corpus Design , 2003, Lit. Linguistic Comput..

[35]  Nancy Ide,et al.  XCES: An XML-based Encoding Standard for Linguistic Corpora , 2000, LREC.