Abstract Large databases of transcribed speech, downloadable from the Internet, are a corpus linguist's dream. They turn into a corpus linguist's nightmare, however, when the transcriptions are not linguistically accurate. In this paper I assess the suitability of the Hansard parliamentary transcripts (200 million words, downloadable) as a corpus linguistic resource, comparing a sample of the official transcript to a transcript made from a recording of a House of Commons session. The findings are that, as could be expected from earlier research, the transcripts omit performance characteristics of spoken language, such as incomplete utterances or hesitations, as well as any type of extrafactual, contextual talk (e.g., about turn-taking). Moreover, however, the transcribers and editors also alter speakers' lexical and grammatical choices towards more conservative and formal variants. Linguists ought, therefore, to be cautious in their use of the Hansard transcripts and, generally, in the use of transcriptio...
[1]
Magnus Levin.
Agreement With Collective Nouns in English
,
2001
.
[2]
Ylva Berglund,et al.
Utilising Present-day English corpora: a case-study concerning expressions of future
,
2000
.
[3]
D. Biber,et al.
Longman Grammar of Spoken and Written English
,
1999
.
[4]
Christian Mair,et al.
Three changing patterns of verb complementation in Late Modern English: a real-time study based on matching text corpora
,
2002,
English Language and Linguistics.
[5]
Sebastian Hoffmann.
From web page to mega-corpus: the CNN transcripts
,
2007
.
[6]
P. Bayley,et al.
Cross-Cultural Perspectives on Parliamentary Discourse
,
2004
.
[7]
S. Slembrouck.
The parliamentary Hansard ‘verbatim’ report: the written construction of spoken discourse
,
1992
.