Une approche multi-vue pour l'extraction terminologique bilingue

RESUME. Ce papier presente une approche multi-vue pour la traduction de termes de special-ite, basee sur un lexique bilingue et un corpus comparable. Nous proposons d'etudier dif-ferents niveaux de representation pour un terme : le contexte, le theme et la graphie. Ces trois approches sont tout d'abord etudiees individuellement, puis combinees afin de selection-ner les meilleures traductions. Des experiences menees sur la traduction de termes medicaux du francais vers l'anglais montrent une amelioration de l'approche classique par contexte, at-teignant une precision de 80,4% de bonnes traductions au rang 1. ABSTRACT. This paper presents a multi-view approach for term translation spotting, based on a bilingual lexicon and comparable corpora. We propose to study different levels of representation for a term: the context, the theme and the orthography. These three approaches are studied individually and combined in order to rank translation candidates. We focus our task on French-English medical terms. Experiments on our new model show a significant improvement of the classical context-based approach, with a precision score of 80.4% for the first ranked translation candidates.

[1]  Pascale Fung,et al.  Finding Terminology Translations from Non-parallel Corpora , 1997, VLC.

[2]  Reinhard Rapp,et al.  Automatic Identification of Word Translations from Unrelated English and German Corpora , 1999, ACL.

[3]  Hwee Tou Ng,et al.  Mining New Word Translations from Comparable Corpora , 2004, COLING.

[4]  Reinhard Rapp,et al.  Identifying Word Translations in Non-Parallel Texts , 1995, ACL.

[5]  M. Rey Learning a Translation Lexicon from Monolingual Corpora , 2002 .

[6]  Philipp Koehn,et al.  A parallel corpus for statistical machine translation , 2005 .

[7]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[8]  David M. Blei,et al.  Multilingual Topic Models for Unaligned Text , 2009, UAI.

[9]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[10]  Philipp Koehn,et al.  Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2 , 2009, EMNLP 2009.

[11]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[12]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[13]  Andrew McCallum,et al.  Polylingual Topic Models , 2009, EMNLP.

[14]  Jian Hu,et al.  Mining multilingual topics from wikipedia , 2009, WWW '09.

[15]  Jean-Michel Renders,et al.  A Geometric View on Bilingual Lexicon Extraction from Comparable Corpora , 2004, ACL.

[16]  François Yvon,et al.  Translating Medical Words by Analogy , 2008 .

[17]  Kevin Knight,et al.  Machine Transliteration , 1997, CL.

[18]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[19]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[20]  Pascale Fung,et al.  Compiling Bilingual Lexicon Entries From a Non-Parallel English-Chinese Corpus , 1995, VLC@ACL.

[21]  J. Jenkins,et al.  Word association norms , 1964 .

[22]  Philippe Langlais,et al.  Revisiting Context-based Projection Methods for Term-Translation Spotting in Comparable Corpora , 2010, COLING.

[23]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[24]  Raphaël Rubino Exploring Context Variation and Lexicon Coverage in Projection-based Approach for Term Translation , 2009, RANLP.

[25]  Stefan Evert,et al.  The Statistics of Word Cooccur-rences: Word Pairs and Collocations , 2004 .