Design and Annotation of MultiMedica – A Multilingual Text Corpus of the Biomedical Domain

Abstract This article describes the MultiMedica corpus, a multilingual collection of Spanish, Japanese, and Arabic texts from the biomedical domain. This novel combination of languages has been chosen with two purposes: the contrastive study of three languages that are typologically and genetically different, and the creation of a gold standard to develop and evaluate an Automatic Term Recognition (ATR) system. A total of 51,476 documents have been collected from the Web, and the corpus contains over seven and a half million words. Most documents were written by medical doctors and edited by journalists for the general public. Each text has been tagged for Part-of-Speech and indexed in an Information Retrieval system and a concordance interface that is aimed at students of Translation, Medicine, and Medical Humanities.