Brains, not brawn: The use of “smart” comparable corpora in bilingual terminology mining

Current research in text mining favors the quantity of texts over their representativeness. But for bilingual terminology mining, and for many language pairs, large comparable corpora are not available. More importantly, as terms are defined vis-à-vis a specific domain with a restricted register, it is expected that the representativeness rather than the quantity of the corpus matters more in terminology mining. Our hypothesis, therefore, is that the representativeness of the corpus is more important than the quantity and ensures the quality of the acquired terminological resources. This article tests this hypothesis on a French-Japanese bilingual term extraction task. To demonstrate how important the type of discourse is as a characteristic of the comparable corpora, we used a state-of-the-art multilingual terminology mining chain composed of two extraction programs, one in each language, and an alignment program. We evaluated the candidate translations using a reference list, and found that taking discourse type into account resulted in candidate translations of a better quality even when the corpus size was reduced by half.

[1]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[2]  Michael E. Lesk,et al.  Computer Evaluation of Indexing and Text Processing , 1968, JACM.

[3]  Jennifer Pearson,et al.  Working with Specialized Language: A Practical Guide to Using Corpora , 2002 .

[4]  Douglas Biber,et al.  Dimensions of Register Variation: A Cross-Linguistic Comparison , 1995 .

[5]  Hang Li,et al.  Base Noun Phrase Translation Using Web Data and the EM Algorithm , 2002, COLING.

[6]  Reinhard Rapp,et al.  Automatic Identification of Word Translations from Unrelated English and German Corpora , 1999, ACL.

[7]  Béatrice Daille,et al.  Conceptual Structuring through Term Variations , 2003, ACL 2003.

[8]  Reinhard Rapp,et al.  Identifying Word Translations in Non-Parallel Texts , 1995, ACL.

[9]  Satoshi Sato,et al.  Compiling French-Japanese Terminologies from the Web , 2006, EACL.

[10]  Pascale Fung,et al.  A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-parallel Corpora , 1998, AMTA.

[11]  I. Dan Melamed A Word-to-Word Model of Translational Equivalence , 1997, ACL.

[12]  Christian Jacquemin,et al.  Spotting and Discovering Terms through Natural Language Processing , 1997 .

[13]  Farid Cerbah,et al.  Exogeneous and Endogeneous Approaches to Semantic Categorization of Unknown Technical Terms , 2000, COLING.

[14]  Christian Jacquemin,et al.  Reducing Information Variation in Text , 2000, ELSNET Summer School.

[15]  Dan Tufis,et al.  Empirical Methods for Exploiting Parallel Texts , 2002, Lit. Linguistic Comput..

[16]  Thomas Beauvisage,et al.  Morphosyntaxe et genres textuels : Exploiter des données morphosyntaxiques pour l'étude statistique des genres textuels : application au roman policier , 2001 .

[17]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[18]  Pierre Zweigenbaum,et al.  Looking for Candidate Translational Equivalents in Specialized, Comparable Corpora , 2002, COLING.

[19]  Fiammetta Namer FLEMM : Un analyseur flexionnel du français à base de règles , 2000 .

[20]  Kenneth Ward Church,et al.  Introduction to the Special Issue on Computational Linguistics Using Large Corpora , 1993, Comput. Linguistics.

[21]  Fatiha Sadat,et al.  An Approach Based on Multilingual Thesauri and Model Combination for Bilingual Lexicon Extraction , 2002, COLING.

[22]  Emmanuel Morin,et al.  French-English Terminology Extraction from Comparable Corpora , 2005, IJCNLP.

[23]  Eric Gaussier,et al.  Une nouvelle approche à l'extraction de lexiques bilingues à partir de corpus comparables , 2007 .

[24]  G. A. Barnard,et al.  Transmission of Information: A Statistical Theory of Communications. , 1961 .

[25]  Pascale Fung,et al.  Finding Terminology Translations from Non-parallel Corpora , 1997, VLC.

[26]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[27]  Timothy Baldwin,et al.  Translation by Machine of Complex Nominals: Getting it Right , 2004 .

[28]  Béatrice Daille Terminology Mining , 2002, SCIE.

[29]  Gregory Grefenstette,et al.  The World Wide Web as a Resource for Example-Based Machine Translation Tasks , 1999, TC.

[30]  Kyo Kageura,et al.  Introduction: Recent trends in com putational terminology , 2004 .

[31]  Kyo Kageura,et al.  Construction of Grammar Based Term Extraction Model for Japanese , 2004 .

[32]  Gregory Grefenstette,et al.  Explorations in automatic thesaurus discovery , 1994 .

[33]  Masatoshi Yoshikawa,et al.  Learning bilingual translations from comparable corpora to cross-language information retrieval: hybrid statistics-based and linguistics-based approach , 2003, IRAL.

[34]  Thierry Hamon,et al.  Structuration de terminologie: quels outils pour quelles pratiques ? , 2002 .

[35]  Douglas Biber,et al.  Representativeness in corpus design , 1993 .

[36]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[37]  Mona T. Diab,et al.  A statistical word-level translation model for comparable corpora , 2000 .

[38]  Béatrice Daille,et al.  Study and Implementation of Combined Techniques for Automatic Extraction of Terminology , 1994 .

[39]  Eric Brill,et al.  Some Advances in Transformation-Based Part of Speech Tagging , 1994, AAAI.

[40]  Jussi Karlgren,et al.  Recognizing Text Genres With Simple Metrics Using Discriminant Analysis , 1994, COLING.

[41]  Carol Peters,et al.  Cross-Language Information Retrieval: A System for Comparable Corpus Querying , 1998 .

[42]  Masatoshi Yoshikawa,et al.  Learning Bilingual Translations from Comparable Corpora to Cross-Language Information Retrieval: Hybrid Statistics-based and Linguistics-based Approach , 2003 .

[44]  Tony McEnery,et al.  Chapter 2. Parallel and Comparable Corpora: What is Happening? , 2007 .

[45]  Mitchell Marcus,et al.  Empirical Methods for Exploiting Parallel Texts , 2001 .

[46]  Richard Xiao,et al.  Parallel and comparable corpora: What are they up to? , 2007 .

[47]  M. Teresa Cabré Castellví,et al.  Automatic term detection: A review of current systems , 2001 .