The Hansard hazard: gauging the accuracy of British parliamentary transcripts1

Abstract Large databases of transcribed speech, downloadable from the Internet, are a corpus linguist's dream. They turn into a corpus linguist's nightmare, however, when the transcriptions are not linguistically accurate. In this paper I assess the suitability of the Hansard parliamentary transcripts (200 million words, downloadable) as a corpus linguistic resource, comparing a sample of the official transcript to a transcript made from a recording of a House of Commons session. The findings are that, as could be expected from earlier research, the transcripts omit performance characteristics of spoken language, such as incomplete utterances or hesitations, as well as any type of extrafactual, contextual talk (e.g., about turn-taking). Moreover, however, the transcribers and editors also alter speakers' lexical and grammatical choices towards more conservative and formal variants. Linguists ought, therefore, to be cautious in their use of the Hansard transcripts and, generally, in the use of transcriptio...