Data in Your Language: the Eci Multilingual Corpus 1

In this paper we describe the contents and the method of production of the ACL European Corpus Initiative Multilingual Corpus 1 (ECI/MC1). This is a large multilingual electronic text corpus, containing 97 million words in 27 (mainly European) languages. It is available cheaply on CDROM. Most of the texts in the corpus are marked up using a fully-validated SGML document type description based on the Text Encoding Initiative (TEI) guidelines for corpus annotation. It is hoped that this corpus will provide a useful resource for corpus-based computational linguistics.