Approches quantitatives de l'extraction de ressources traductionnelles à partir de corpus parallèles

This research work presents the results of a series of experiments devoted to the development of new tools for intertextual textometric exploration of translation corpora. Various methods of textual statistics have been adapted for use in multilingual context and put into practice for parallel text processing, such as repeated segments extraction, characteristic elements computation, bi-textual topography, multiple co-occurrences, factorial analysis, automatic classification, etc. Examples of concrete applications illustrate the use of each of these methods in multilingual context. These examples are accompanied by sample translation resources obtained on quantitative bases from the parallel French/English corpus of the Convention for the Protection of Human Rights. The suggested approach opens up new horizons for automatic exploration of lexical equivalences of translation corpora by a variety of users: translators, foreign language teachers, terminologists, lexicographers, etc.

[1]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[2]  I. Dan Melamed,et al.  Bitext Maps and Alignment via Pattern Recognition , 1999, CL.

[3]  David Yarowsky,et al.  Word-Sense Disambiguation Using Statistical Models of Roget’s Categories Trained on Large Corpora , 2010, COLING.

[4]  Harold L. Somers,et al.  An introduction to machine translation , 1992 .

[5]  Pierre Lafon Analyse lexicométrique et recherche des cooccurrences , 1981 .

[6]  Gregory Grefenstette,et al.  Use of syntactic context to produce term association lists for text retrieval , 1992, SIGIR '92.

[7]  Douglas Biber,et al.  Co-occurrence Patterns among Collocations: A Tool for Corpus-Based Lexical Knowledge Acquisition , 1993, CL.

[8]  Beatrice Santorini Part-of-speech tagging guidelines for the penn treebank project , 1990 .

[9]  Stanley F. Chen,et al.  Aligning Sentences in Bilingual Corpora Using Lexical Information , 1993, ACL.

[10]  André Salem Pratique des segments répétés : essai de statistique textuelle , 1987 .

[11]  L. R. Dice Measures of the Amount of Ecologic Association Between Species , 1945 .

[12]  Michel Simard,et al.  Using cognates to align sentences in bilingual corpora , 1993, TMI.

[13]  Philip Resnik,et al.  Parallel strands: a preliminary investigation into mining the Web for bilingual text , 1998, AMTA.

[14]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.

[15]  Éric Gaussier Flow Network Models for Word Alignment and Terminology Extraction from Bilingual Corpora , 1998, COLING-ACL.

[16]  Pascale Fung,et al.  Compiling Bilingual Lexicon Entries From a Non-Parallel English-Chinese Corpus , 1995, VLC@ACL.

[17]  Lars Borin,et al.  ETAP-WebTEq: a web-based tool for exploring translation equivalents on word and sentence level in multilingual parallel corpora , 2000 .

[18]  Jean V ronis Parallel Text Processing: Alignment and Use of Translation Corpora , 2002 .

[19]  Harold L. Somers Further Experiments in Bilingual Text Alignment , 1998 .

[20]  Laurent Romary,et al.  The Lingua Parallel Concordancing Project: Managing Multilingual Texts for Educational Purpose , 1993 .

[21]  Éric Gaussier,et al.  Towards Automatic Extraction of Monolingual and Bilingual Terminology , 1994, COLING.

[22]  B. Harris Bi-text, a new concept in translation theory , 1988 .

[23]  Ronald J. Brachman,et al.  An Overview of the KL-ONE Knowledge Representation System , 1985, Cogn. Sci..

[24]  Stig Johansson,et al.  Coding and Aligning the English-Norwegian Parallel Corpus , 1996 .

[25]  Lucie Langlois Bilingual concordancers: a new tool for bilingual lexicographers , 1996, AMTA.

[26]  Philip H. Miller,et al.  Formalismes syntaxiques pour le traitement automatique du langage naturel , 1990 .

[27]  Fatiha Sadat,et al.  An Approach Based on Multilingual Thesauri and Model Combination for Bilingual Lexicon Extraction , 2002, COLING.

[28]  Mark Liberman,et al.  A formal framework for linguistic annotation , 1999, Speech Commun..

[29]  Michael Grüninger,et al.  Introduction , 2002, CACM.

[30]  Paul Bennett The Translation Unit in Human and Machine , 1994 .

[31]  Harold L. Somers Similarity Metrics for Aligning Children's Articulation Data , 1998, COLING-ACL.

[32]  Olivier Kraif Constitution et exploitation de bi-textes pour l'Aide à la traduction , 2001 .

[33]  A. G. Oettinger,et al.  Language and information , 1968 .

[34]  Alan K. Melby Translators and Machines - Can they Cooperate? , 1981 .

[35]  John Hutchins,et al.  The Origins of the Translator's Workstation , 1998, Machine Translation.

[36]  William Martinez Contribution à une méthodologie de l'analyse des cooccurrences lexicales multiples dans les corpus textuels , 2003 .

[37]  H. Kunkel GENERAL INTRODUCTION , 1971, The Journal of experimental medicine.

[38]  авт,et al.  Теория и методы проектирования оптимальных регуляторов , 1985 .

[39]  Michael Barlow MonoConc 1.5 and ParaConc , 1999 .

[40]  Alan K. Melby Sharing of translation memory databases derived from aligned parallel text , 2000 .

[41]  Serge Sharoff,et al.  Meaning as use: exploitation of aligned corpora for the contrastive study of lexical semantics , 2002, LREC.

[42]  Charles Muller,et al.  Dépouillements et statistiques en lexicométrie , 1984 .

[43]  Yiming Yang,et al.  Automatic dictionary extraction for cross-language information retrieval , 2000 .

[44]  Dekai Wu,et al.  Bracketing and aligning words and constituents in parallel text using Stochastic Inversion Transduction Grammars , 2000 .

[45]  Masahiko Haruno,et al.  High-performance bilingual text alignment using statistical and dictionary information , 1997, Nat. Lang. Eng..

[46]  G. Dias,et al.  Cognates alignment , 2001, MTSUMMIT.

[47]  Jérôme Pagès,et al.  Analysis of multilingual free responses , 2004 .

[48]  Joseba Abaitua,et al.  DTD-driven bilingual document generation , 2000, INLG.

[49]  Daniel R. Tauritz,et al.  Adaptive Information Filtering: Evolutionary Computation and n -gram Representation , 2000 .

[50]  J.-M. Lange,et al.  Alignement de corpus multilingues au niveau des phrases , 1995 .

[51]  J.-M. Lange,et al.  Modèles statistiques pour l'extraction de lexiques bilingues , 1995 .

[52]  C. Muller Principes et méthodes de statistique lexicale , 1992 .

[53]  Maria Zimina Alignement de textes bilingues par classification ascendante hiérarchique , 2000 .

[54]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[55]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[56]  Philip Resnik,et al.  Semi-Automatic Acquisition of Domain-Specific Translation Lexicons , 1997, ANLP.

[57]  Elliott Macklovitch,et al.  Line ‘Em Up: Advances in Alignment Technology and their Impact on Translation Support Tools , 2004, Machine Translation.

[58]  Kenneth Ward Church,et al.  Termight: Coordinating Humans and Machines in Bilingual Terminology Acquisition , 2004, Machine Translation.

[59]  G. Mounin Les problèmes théoriques de la traduction , 1963 .

[60]  Deryle W. Lonsdale Extraction D'un Vocabulaire Bilingue: Outils Et M'ethodes , 1994 .

[61]  Gregory Grefenstette,et al.  Explorations in automatic thesaurus discovery , 1994 .

[62]  Kenneth Ward Church,et al.  K-vec: A New Approach for Aligning Parallel Texts , 1994, COLING.

[63]  I. Dan Melamed,et al.  Models of translation equivalence among words , 2000, CL.

[64]  Ido Dagan,et al.  A comprehensive bilingual word alignment system , 2000 .

[65]  I. Dan Melamed A Portable Algorithm for Mapping Bitext Correspondence , 1997, ACL.

[66]  Grzegorz Kondrak,et al.  Identifying Cognates by Phonetic and Semantic Similarity , 2001, NAACL.

[67]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[68]  Michel Simard,et al.  Bilingual Sentence Alignment: Balancing Robustness and Accuracy , 2004, Machine Translation.

[69]  Douglas Biber,et al.  Using Register-Diversified Corpora for General Language Studies , 1993, Comput. Linguistics.

[70]  Mona T. Diab,et al.  An Unsupervised Method for Multilingual Word Sense Tagging Using Parallel Corpora , 2000, ACL 2000.

[71]  Jean-Marie Pierrel,et al.  Ingénierie des langues , 2000 .

[72]  Pascale Fung,et al.  A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-parallel Corpora , 1998, AMTA.

[73]  Merrill D. Benson,et al.  The BBI Combinatory Dictionary of English , 1989 .

[74]  Wang Lixun EXPLORING PARALLEL CONCORDANCING IN ENGLISH AND CHINESE , 2001 .

[75]  Adam Kilgarriff,et al.  Introduction to the Special Issue on the Web as Corpus , 2003, CL.

[76]  Kenneth Ward Church Char_align: A Program for Aligning Parallel Texts at the Character Level , 1993, ACL.

[77]  Institute for Research in Cognitive Science Attention in Early Scientific Psychology , 1995 .

[78]  P. Isabelle La bi-textualité : vers une nouvelle génération d’aides à la traduction et la terminologie , 1992 .

[79]  Pierre Bouillon,et al.  La traductique : études et recherches de traduction par ordinateur , 1993 .

[80]  Julian Kupiec,et al.  An Algorithm for Finding Noun Phrase Correspondences in Bilingual Corpora , 1993, ACL.

[81]  Chris Brew,et al.  Word-Pair Extraction for Lexicography , 1996 .

[82]  Yorick Wilks,et al.  The Interaction of Knowledge Sources in Word Sense Disambiguation , 2001, CL.

[83]  Robert L. Mercer,et al.  Aligning Sentences in Parallel Corpora , 1991, ACL.

[84]  Pascale Fung,et al.  Aligning Noisy Parallel Corpora Across Language Groups: Word Pair Feature Matching by Dynamic Time Warping , 1994, AMTA.

[85]  Grzegorz Kondrak,et al.  A New Algorithm for the Alignment of Phonetic Sequences , 2000, ANLP.

[86]  Gregory Grefenstette,et al.  Cross-Language Information Retrieval , 1998, The Springer International Series on Information Retrieval.

[87]  Harold L. Somers,et al.  Bilingual vocabulary estimation from noisy parallel corpora using variable bag estimation , 1997 .

[88]  Evelyne Tzoukermann,et al.  The BICORD System Combining Lexical Information from Bilingual Corpora and Machine Readable Dictionaries , 1990, COLING.

[89]  Michel Guillou,et al.  Ressources et évaluation en ingénierie des langues , 2000 .

[90]  Pierre Zweigenbaum,et al.  Looking for French-English translations in comparable medical corpora , 2002, AMIA.

[91]  Yehoshua Bar-Hillel,et al.  The Present Status of Automatic Translation of Languages , 1960, Adv. Comput..

[92]  Satoru Ikehara,et al.  Learning Bilingual Collocations by Word-Level Sorting , 1996, COLING.

[93]  M. Barlow ParaConc : Concordance Software for Multilingual Parallel Corpora , 2002 .

[94]  W. J. Hutchins Machine translation over fifty years , 2001 .

[95]  Ralf D. Brown Automatically-Extracted Thesauri for Cross-Language IR: When Better is Worse , 1998 .

[96]  M J Sternberg,et al.  An approach to improving multiple alignments of protein sequences using predicted secondary structure. , 2001, Protein engineering.

[97]  Kenneth Ward Church,et al.  Identifying Word Correspondences in Parallel Texts , 1991, HLT.

[98]  Martin Kay,et al.  Text-Translation Alignment , 1993, Comput. Linguistics.