'Virtual' Text Corpora and their Management

The extensive use of computer-based corpora for a range of language studies has led to the proliferation of the ways in which texts within an individual corpus are organised Basically, the organisation reflects the immediate needs of a group of well motivated users, like lexicographers or terminologists. This means that the subsequent generation of corpus users are forced to use a classification of texts according to categories they may not be familiar with or may not be comfortable with or both. There is an urgent need to have a facility in corpus management system that allow its users to use their own classification system to categorise texts in a corpus. That is, the users should be able to choose, for example, their own style, register, field, time span, author attributes for generating word lists, concordances, contextual examples etc. A lexicography/terminology management system, System Quirk, is described that can support such a virtual organisation of texts within a corpus. Introduction There are open questions in corpus linguistics related to how texts should be selected and, perhaps, more importantly for what purpose. Some argue that lexicographers and linguists should choose the texts themselves with some advice from teachers of English (Sinclair and colleagues in Sinclair 1987), whilst the corpus linguistics pioneers used a random-selection approach (cf. Lancaster Oslo Bergen Corpus and the Brown Corpus). Still others have argued that there should be an equal mixture of deliberately selected text and randomly selected text (see, for instance, Summers 1991).