THE CORPORA OF ESTONIAN AT THE UNIVERSITY OF TARTU : THE CURRENT SITUATION

This paper gives an overview of the corpus-related work done at the University of Tartu so far and describes an ongoing project – compiling a big corpus of written Estonian containing approximately 100 million words. The previously col lected corpora of standard written Estonian at the University of Tartu are well-balanced and re presentative, but a little too small for the studies of statistically not so frequent phenomena in language, not to speak of the needs of language technology. The corpus under compilation r ght now, called the Mixed Corpus of Estonian, is planned as an open monitor corpus, but will also contain a more balanced subcorpus. In addition to these corpora of standard written Es tonian, the paper gives a very brief overview of the Corpus of Estonian Dialects, The Co rpus of Spoken Estonian and the Corpus of Old Literary Estonian and discusses some special an not ted corpora in more detail, namely the morphologically annotated corpus and the Estonian-E nglish parallel corpus of legislative texts.