论文信息 - Bilingual terminology extraction dataset KAS-biterm 1.0

Bilingual terminology extraction dataset KAS-biterm 1.0

The KAS-biterm bilingual term extraction dataset contains complete sentences selected from PhD theses from the KAS corpus of Slovene academic writing. Only sentences that have a high chance of containing the term in the original language and its translation into Slovene were chosen, by using three CQL patterms in noSketch Engine. These sentences are manually annotated for (1) terms, (2) partial terms and (3) abbreviations in (a) Slovene, (b) English, or (c) other language. Links between the Slovene terms and their equivalents in the other languages, as well as their abbreviations, are encoded as well. The resource can serve as a training set for supervised learning of bilingual term extraction tools and their benchmarking.

Tomaž Erjavec | Darja Fišer | Nikola Ljubešić | Maja Bitenc