English Corpus Linguistics: Collecting and computerizing data

Once the basic outlines of a corpus are determined, it is time to begin the actual creation of the corpus. This is a three-part process, involving the collection, computerization, and annotation of data. This chapter will focus on the first two parts of this process – how to collect and computerize data. The next chapter will focus in detail on the last part of the process: the annotation of a corpus once it has been encoded into computer-readable form. Collecting data involves recording speech, gathering written texts, obtaining permission from speakers and writers to use their texts, and keeping careful records about the texts collected and the individuals from whom they were obtained. How these collected data are computerized depends upon whether the data are spoken or written. Recordings of speech need to be manually transcribed using either a special cassette tape recorder that can automatically replay segments of a recording, or software that can do the equivalent with a sample of speech that has been converted into digital form. Written texts that are not available in electronic form can be computerized with an optical scanner and accompanying OCR (optical character recognition) software, or (less desirably) they can be retyped manually. Even though the process of collecting, computerizing, and annotating texts will be discussed as separate stages in this and the next chapter, in many senses the stages are closely connected: after a conversation is recorded, for instance, it may prove more efficient to transcribe it immediately, since whoever made the recording will be available to answer questions about it and to aid in its transcription.

[1]  Geoffrey Leech,et al.  Manual of Information for the Lancaster Parsed Corpus , 1999 .

[2]  Guy Aston,et al.  The BNC Handbook: Exploring the British National Corpus with SARA , 1998 .

[3]  Ian Lancashire,et al.  Synchronic corpus linguistics : papers from the sixteenth International Conference on English Language Research on Computerized Corpora (ICAME 16) , 1996 .

[4]  Steve Crowdy Spoken Corpus Design , 1993 .

[5]  W. Kretzschmar SPSS Student Version 9.0 for Windows , 2000 .

[6]  Hans van Halteren,et al.  Linguistic Exploitation of Syntactic Databases: The Use of the Nijmegen Linguistic Database Program , 1992 .

[7]  Merja Kytö,et al.  Tracing the trail of time : proceedings from the Second Diachronic Corpora Workshop, New College, University of Toronto, Toronto, May 1995 , 1997 .

[8]  Anne Wichmann,et al.  Teaching and Language Corpora , 1997 .

[9]  A. Woods,et al.  Statistics in Language Studies , 1986 .

[10]  Randolph Quirk,et al.  On corpus principles and design , 1992 .

[11]  Sidney Greenbaum,et al.  Ellipsis and coordination: Norms and preferences , 1982 .

[12]  John Makhoul,et al.  Further advances in transcription of broadcast news , 1999, EUROSPEECH.

[13]  Douglas Biber,et al.  Historical Change in the Language Use of Women and Men , 2000 .

[14]  Jan Svartvik,et al.  The London-Lund corpus of spoken english , 1990 .

[15]  D. Biber,et al.  Longman Grammar of Spoken and Written English , 1999 .

[16]  Bas Aarts,et al.  The verb in contemporary English , 1995 .

[17]  Timo Järvinen Annotating 200 Million Words: The Bank Of English Project , 1994, COLING.

[18]  George R. Doddington CSR Corpus Development , 1992, HLT.

[19]  Sali A. Tagliamonte Was/were variation across the generations: View from the city of York , 1998, Language Variation and Change.

[20]  Susan Conrad,et al.  Corpus Linguistics: Investigating Language Structure and Use , 1998 .

[21]  G. Tottie Negation in English speech and writing : a study in variation , 1993 .

[22]  Terttu Nevalainen,et al.  Gender Differences in the Evolution of Standard English , 2000 .

[23]  Aquilino Sánchez,et al.  Predictability of word forms (types) and lemmas in linguistic corpora. A Case Study Based on the Analysis of the CUMBRE Corpus:: an 8-million-word Corpus of contemporary Spanish , 1997 .

[24]  S. Hockey Electronic Texts in the Humanities , 2000 .

[25]  Andrew Prescott,et al.  The Electronic Beowulf and digital restoration , 1997 .

[26]  Timo Järvinen,et al.  A non-projective dependency parser , 1997, ANLP.

[27]  F. Newmeyer Language Form And Language Function , 1998 .

[28]  C. M. Sperberg-McQueen,et al.  TEI Lite: An Introduction to Text Encoding for Interchange , 2001, WWW 2001.

[29]  Gerhard Leitner New directions in English language corpora : methodology, results, software developments , 1992 .

[30]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[31]  Charles F. Meyer,et al.  Coordination ellipsis in spoken and written American English , 1995 .

[32]  Michael Lesk,et al.  Review of The computational analysis of English: a corpus-based approach by Roger Garside, Geoffrey Leech, and Geoffrey Sampson. Longman 1987. , 1988 .

[33]  Charles F. Meyer Coordinate structures in English , 1996 .

[34]  R. Quirk A Grammar of contemporary English , 1974 .

[35]  Sidney Greenbaum,et al.  Syntactic frequency and acceptability , 1976 .

[36]  Brian MacWhinney,et al.  The CHILDES System , 1996 .

[37]  Liliane Haegeman,et al.  Introduction to Government and Binding Theory , 1991 .

[38]  Igor Mel’čuk,et al.  Dependency Syntax: Theory and Practice , 1987 .

[39]  Peter Collins,et al.  Cleft and Pseudo-Cleft Constructions in English , 1991 .

[40]  Tim Johns,et al.  Perspectives on Pedagogical Grammar: From printout to handout: Grammar and vocabulary teaching in the context of Data-driven Learning , 1994 .

[41]  Douglas Biber,et al.  Variation across speech and writing: Methodology , 1988 .

[42]  C. F. Meyer Can you see whose speech is overlapping , 1994 .

[43]  Branimir Boguraev,et al.  Review of Looking up: an account of the COBUILD project in lexical computing by John M. Sinclair. Collins ELT 1987. , 1990 .

[44]  Udo Fries,et al.  Creating and Using English Language Corpora , 1994 .

[45]  Mats Rydén,et al.  Noun‐name collocations in British English newspaper language , 1975 .

[46]  J. Milton,et al.  Lexical variation in the writing of Chinese learners of English , 1996 .

[47]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[48]  W. Nelson Francis,et al.  Language corpora B.C. , 1992 .

[49]  Sidney Greenbaum,et al.  Comparing English worldwide : the International Corpus of English , 1996 .

[50]  Allan Bell,et al.  The British Base and the American Connection in New Zealand Media English , 1988 .

[51]  John Sinclair,et al.  Corpus, Concordance, Collocation , 1991 .

[52]  D. Biber Methodological Issues Regarding Corpus-based Analyses of Linguistic Variation , 1990 .

[53]  Michael Barlow MonoConc 1.5 and ParaConc , 1999 .

[54]  Sylviane Granger,et al.  Learner English on Computer , 1998 .

[55]  Atro Voutilainen,et al.  Comparing a Linguistic and a Stochastic Tagger , 1997, ACL.

[56]  Sylviane Granger,et al.  The International Corpus of Learner English , 1993 .

[57]  Sidney Greenbaum Informant elicitation of data on syntactic variation , 1973 .

[58]  Merja Kytö,et al.  English in transition : corpus-based studies in linguistic variation and genre styles , 1997 .

[59]  D. Tannen Talking Voices: Repetition, Dialogue, and Imagery in Conversational Discourse , 1989 .

[60]  Mick Short,et al.  Using Corpora for Language Research , 1998 .

[61]  D. Walshaw Introduction to quantitative analysis of linguistic survey data , 1999 .

[62]  Atro Voutilainen A Short History of Tagging , 1999 .

[63]  Bas Aarts,et al.  Corpus linguistics, Chomsky and Fuzzy Tree Fragments , 2000, Corpus Linguistics and Linguistic Theory.

[64]  Sali A. Tagliamonte,et al.  “I Used to Dance, but I Don’t Dance Now” , 2000 .

[65]  Sidney Greenbaum,et al.  A new corpus of English: ICE , 1992 .

[66]  W. Nelson Francis,et al.  FREQUENCY ANALYSIS OF ENGLISH USAGE: LEXICON AND GRAMMAR , 1983 .

[67]  Terence Odlin,et al.  Perspectives on Pedagogical Grammar: GRAMMAR, LEXICON, AND DISCOURSE , 1994 .

[68]  John M. Kirk,et al.  The Northern Ireland Transcribed Corpus of Speech , 1992 .

[69]  P. M. W. Robinson New methods of editing, exploring, and reading The Canterbury Tales , 1999 .

[70]  Sidney I. Landau Dictionaries: The Art and Craft of Lexicography , 1985 .

[71]  O. Jespersen A modern English grammar on historical principles , 1928 .

[72]  Nelleke Oostdijk,et al.  Corpus Linguistics and the Automatic Analysis of English , 1991 .

[73]  Noam Chomsky,et al.  The Minimalist Program , 1992 .

[74]  Jan Svartvik,et al.  Directions in corpus linguistics : proceedings of Nobel Symposium 82, Stockholm, 4-8 August 1991 , 1992 .

[75]  Graham Kalton,et al.  Introduction to Survey Sampling , 1983 .

[76]  Liliane Haegeman,et al.  Register Variation in English: Some Theoretical Observations , 1987 .

[77]  Michael Oakes,et al.  Statistics for Corpus Linguistics , 1998 .

[78]  J. Coates The semantics of the modal auxiliaries , 1983 .

[79]  Sidney Greenbaum,et al.  The Oxford English Grammar , 1996 .

[80]  Jan Svartvik,et al.  A __ comprehensive grammar of the English language , 1988 .

[81]  Charles F. Meyer Apposition in contemporary English: List of figures , 1992 .

[82]  R. Quirk,et al.  A Corpus of English Conversation , 1980 .

[83]  Matti P. Rissanen The World of English Historical Corpora , 2000 .

[84]  Bengt Altenberg,et al.  The use of adverbial connectors in advanced Swedish learners' written English , 1998 .

[85]  Douglas Biber,et al.  Dimensions of Register Variation: A Cross-Linguistic Comparison , 1995 .

[86]  Clive Souter,et al.  Corpus-Based Computational Linguistics , 1993 .

[87]  Anthony McEnery,et al.  Rethinking language pedagogy from a corpus perspective. , 2000 .

[88]  Vincent B. Y. Ooi Computer Corpus Lexicography , 1998 .

[89]  Merja Kytö,et al.  Manual to the diachronic part of the Helsinki Corpus of English texts : cording conventions and lists of source texts , 1993 .

[90]  B. Aarts Small clauses in English : the nonverbal types , 1992 .