Corpora and Corpus-Based Morpho-Lexical Processing

The term corpus as used here refers to a collection of spoken or written texts encoded into a specific machine readable format. Corpora are used in language engineering to gather both qualitative and quantitative real language evidence. Qualitative evidence consists of examples which can be used for the construction of computational lexicons, grammars, and multi-lingual lexicons and term banks, for lexicography, etc. Quantitative information consists of statistics indicating frequent or characteristic uses of language. These statistics can also be used to guide preference-based parsers, assist in lexicography, determine translation equivalents, etc. In addition, statistics can be used to drive morphological taggers, POS taggers, alignment programs, sense taggers, etc. Common operations on corpora for the purposes of language engineering include extraction of sub-corpora; sophisticated search and retrieval, including collocation extraction, concordance generation, generation of lists of linguistic elements, etc.; and the determination of statistics such as frequency information, averages, mutual information scores, etc. We do not address corpora intended for other applications, such as stylistic studies, socio-linguistics, historical studies, information retrieval, etc., although these uses are not excluded a priori (in fact, many of the features required for these applications may be the same as those needed for language engineering). The encoding format should be standardised and homogeneous both for reasons concerning interchange and open-ended retrieval tasks [1]. Treating a restricted domain enables development of a tighter standard than that of the TEI, by providing specific encoding solutions rather than general or multiple ones, and, most importantly, by providing standards for elements particularly important in that domain (e.g., linguistic annotation). The texts are selected according to explicit criteria, according to the main purpose the corpus is supposed to serve (lexicographic tasks; terminological data extraction/acquisition; contrastive/comparative studies on parallel texts; grammar induction; machine translation; etc.). Based on several linguistic classification criteria, Sinclair defines in [1] a corpus typology (see also [2] in this volume). From that typology, we are here concerned with the special and parallel type corpora.